CN109274922A

CN109274922A - A kind of Video Conference Controlling System based on speech recognition

Info

Publication number: CN109274922A
Application number: CN201811380150.4A
Authority: CN
Inventors: 郑广宁; 魏永静; 田兵; 刘鸿雁; 车四四; 何子亨; 李宗皓; 孙小骏; 杨超
Original assignee: State Grid Corp of China SGCC; Information and Telecommunication Branch of State Grid Shandong Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; Information and Telecommunication Branch of State Grid Shandong Electric Power Co Ltd
Priority date: 2018-11-19
Filing date: 2018-11-19
Publication date: 2019-01-25

Abstract

The invention belongs to Video Conference Controlling System domain variabilities to disclose a kind of Video Conference Controlling System based on speech recognition；Including phonetic order input system, speech analysis processing system, meeting station control system, the phonetic order input system receives the phonetic order in each meeting-place, and phonetic order is transferred to speech analysis processing system, the speech analysis processing system identifies phonetic order, and control signal is issued to meeting station control system, the meeting station control system controls the equipment in meeting-place after receiving control signal；The phonetic order input system includes multiple pronunciation receivers, and is provided at least one pronunciation receiver in meeting-place of each attending a meeting；The present invention provides Video Conference Controlling System, can accurately determine the bad meeting-place of order, and propose to warn to it, this has not only saved the cost for maintaining meeting-place order, while also reducing the external interference to video conference control.

Description

A kind of Video Conference Controlling System based on speech recognition

Technical field

The present invention relates to Video Conference Controlling System technical field more particularly to a kind of video conferences based on speech recognition Control system.

Background technique

With the development of real-time video technology, in modern business activities, video conference has become very universal.However In the prior art, a considerable amount of meeting support personnels are needed to arrange, so many personnel are not only difficult to carry out cooperation, And any fault in linking cooperation all can cause meeting safeguard work to go wrong, and it, should in more meeting-place video conferences Problem performance more protrudes.Meanwhile increasing with each department's business demand, demand of the participant to meeting Self-Service are got over Come more urgent, and in existing conference system, conference agenda must be fixed up in advance to ensure meeting operation according to setting in advance Want to carry out, it is difficult to make change according to the offhand decision of participant, user's experience is bad.

Therefore, how a kind of system that can carry out intelligentized control method to video conference is provided, conference dispatching stream is being simplified While journey, participant's Autonomous Scheduling meeting process can be realized, that improves video conference holds efficiency, reduces the generation of fault It is the technical problem that those skilled in the art need to solve.

Summary of the invention

Weak, participant can not Autonomous Scheduling meeting for video conferencing system autonomous operation ability in the prior art by the present invention Etc. technical problems, and it is strong and can make the video conference of participant's Autonomous Scheduling meeting process to provide a kind of autonomous operation ability Control system.

The present invention using following technical scheme in order to solve the above technical problems, realized:

Design a kind of Video Conference Controlling System based on speech recognition, including phonetic order input system, speech analysis Processing system, meeting station control system and control instruction record system, the phonetic order input system and the voice Analysis process system connection, the speech analysis processing system are connect with the meeting station control system；The meeting-place control System processed is connect with the control instruction record system；

The phonetic order input system is used to receive the phonetic order in each meeting-place, and phonetic order is transferred to language Sound analysis process system；The speech analysis processing system identifies phonetic order, and issues control letter to meeting station control system Number；After the meeting station control system receives control signal, and control is issued to the equipment in meeting-place according to control signal and is referred to It enables；The control instruction record system is based on for each time point in multiple time points on the time of meeting line Configuration file extracts the command information in each meeting-place, and is interacted or edited according to the command information of extraction；The voice Instruction input system includes pronunciation receiver, pronunciation receiver control subsystem and spokesman's focusing subsystem, speech The setting of people's focusing subsystem can enable what the control instruction of spokesman was more clear to be transferred in system, improve whole The accuracy of a conference control system；The pronunciation receiver is equipped in multiple and meeting-place of each attending a meeting and is provided at least One pronunciation receiver；The pronunciation receiver control subsystem includes that the first signal acquiring unit and control are single Member；Spokesman's focusing subsystem includes second signal acquiring unit, signal generation unit, signature computation unit and language Sound reception device control unit.Due to the pronunciation receiver in each meeting-place have it is multiple, and in different meetings, the number of participant Amount is also not quite similar, and the setting of pronunciation receiver control subsystem can make not correspond to the pronunciation receiver of participant not It opens, using electricity wisely cost improves the utilization rate of pronunciation receiver.

Above-mentioned technical proposal controls pronunciation receiver with meeting-place support personnel and controls completely different, this side to carry out instruction In case, due to being provided with multiple pronunciation receivers in meeting-place of attending a meeting, the participant in each meeting-place can participate in meeting-place Scheduling eliminates the scheduling institution for being responsible for scheduling meeting-place specially, keeps the process of meeting more smooth, and can be according to practical meeting Situation is regulated and controled, it is not necessary to which the mechanical commander according to scheduling institution carries out meeting.

Preferably, there are three the pronunciation receiver in each meeting-place is all provided with, a voice can be only set to avoid meeting-place When reception device, the problem of cannot clearly be received apart from the farther away participant of the device its control instruction, participant is increased The participation of person；And pronunciation receiver its voice messaging for being used to receive participant, the pronunciation receiver are also simultaneously Speech device；The program mutually unifies the pronunciation receiver in conference control system with the speech equipment in meeting, saves The use of equipment, while the operation for switching distinct device in meeting is also eliminated, keep entire video conference more smooth.

Preferably, first signal acquiring unit, for obtaining the location information of participant, the control list Member, the pronunciation receiver in setting range corresponding to participant's location information for obtaining the first signal acquiring unit Open.

Preferably, the second signal acquiring unit includes video acquisition unit and the first voice acquisition unit, institute The video acquisition stated is applied alone in the video information for obtaining multiple participants；First voice acquisition unit is for obtaining meeting Audio-frequency information；The signal generation unit, to video acquisition unit obtain video information in, each participant's speech activity Relevant visual signal is detected respectively, generates the visual activity detection signal to match with each participant；Simultaneously to the The audio-frequency information that one voice acquisition unit obtains is detected, to generate voice activity detection signal；The signature computation unit, For the multiple visual activity detection signal to be compared with the voice activity detection signal respectively, and will be with institute's predicate Participant corresponding to the highest visual activity detection signal of the sound activity detection signal degree of correlation is determined as current speaker；It is described Pronunciation receiver control unit, the spokesman for receiving signature computation unit determine as a result, to the pronunciation receiver in meeting-place It is controlled, so that being transferred in system of being more clear of the voice of spokesman.

Preferably, the phonetic order input system further includes warning subsystems, and the warning subsystems include third Signal acquiring unit, abnormal meeting-place determination unit and reminding unit；The third signal acquiring unit, for obtaining default In time interval, at least one of the audio-frequency information in each meeting-place, video information conference signal in video conference, the audio letter Breath is obtained by pronunciation receiver；Exception meeting-place determination unit, for what is obtained to the third signal acquiring unit The conference signal in each meeting-place is analyzed, and determines the related meeting-place for influencing meeting order；Exception meeting-place determination unit, including Signal acquisition module, for obtaining the audio-frequency information in each meeting-place in preset time period；The exception meeting-place determination unit further includes Signal analysis module is analyzed for the audio-frequency information to each meeting-place in the preset time period, and determining influences meeting order Abnormal meeting-place；The reminding unit, for reminding the abnormal meeting-place for influencing meeting order.

Preferably, the signal analysis module further includes that the first processing subelement and first determine subelement；It is described The first processing subelement for the audio-frequency information according to each meeting-place obtain the audio status in each meeting-place, the audio status packet Include talk situation and non-talk situation；Described first determines subelement, when detecting two or more meeting-place audios When state is talk situation, determine described two or more than two meeting-place for the abnormal meeting-place of influence meeting order.

Preferably, the phonetic order input system further includes echo processing subsystem；The echo handles subsystem System includes the second voice acquisition unit and echo processing module；Second voice acquisition unit includes that several voices obtain Modulus block, audio frequency vibration module, speech detection module and session control center, the voice obtain module with one Audio frequency vibration module is connected with a speech detection module；The speech detection module is obtained for detecting corresponding voice The audio-frequency information of module is sent to session control center；The audio frequency vibration module obtains module for detecting corresponding voice Audio frequency vibration information, be sent to session control center；Volume session control center, receives and processes speech detection module The audio frequency vibration information of audio-frequency information and audio frequency vibration module, and it is sent to echo processing module；The echo processing module It receives audio-frequency information and eliminates echo, send the audio-frequency information after eliminating echo to adaptive-filtering module；Described is adaptive Filter module receives the audio-frequency information of echo processing module, and speech analysis processing system is sent to after filtering processing；Some plays Meeting in, since related participant is less, cause meeting-place spacious, the voice of spokesman can form echo in meeting-place, this Very big influence is caused on the speech recognition of conference control system, the setting of echo processing subsystem reduces the influence of echo, Improve the accuracy of meeting-place control.

Preferably, the described control instruction record system include extraction unit, index point generation unit, complete unit and Interaction and edit cell；The extraction unit, for each time point in multiple time points on the time of meeting line, The command information in each meeting-place is extracted based on configuration file, wherein the time of meeting line and meeting time correlation join, it is described to match Set command information of the file for conference setup；The index point generation unit, for believing the instruction in each meeting-place Breath is combined into crucial index point, and the key index point is used as the index point for being interacted or being edited with instruction record；Described Unit is completed, for the multiple crucial index points for corresponding to multiple time points to be combined into instruction record；The interaction and volume Unit is collected, is interacted or is edited with described instruction record for the key message in being recorded according to described instruction.

A kind of Video Conference Controlling System based on speech recognition proposed by the present invention, beneficial effect are:

(1) present invention provides Video Conference Controlling System, makes the participant in each meeting-place that can participate in the tune in meeting-place Degree eliminates the scheduling institution for being responsible for scheduling meeting-place specially, keeps the process of meeting more smooth, and can be according to practical meeting feelings Condition is regulated and controled, it is not necessary to which the mechanical commander according to scheduling institution carries out meeting；

(2) present invention provides Video Conference Controlling System, can also more accurately determine spokesman, make the control of spokesman It instructs what can be more clear to be transferred in system, improves the accuracy of entire conference control system；

(3) present invention provides Video Conference Controlling System, can accurately determine the bad meeting-place of order, and propose to warn to it Show, this has not only saved the cost for maintaining meeting-place order, while also reducing the external interference to video conference control.

Detailed description of the invention

The present invention is described in further detail for embodiment in reference to the accompanying drawing, but does not constitute to of the invention Any restrictions.

Fig. 1 is a kind of structural schematic diagram of specific embodiment of Video Conference Controlling System of the present invention；

Fig. 2 is the structural schematic diagram of the first specific embodiment of phonetic order input system of the present invention；

Fig. 3 is a kind of structural schematic diagram of specific embodiment of phonetic order input system of the present invention；

Fig. 4 is a kind of structural schematic diagram of specific embodiment of phonetic order input system of the present invention；

Fig. 5 is a kind of structural schematic diagram of specific embodiment of second signal acquiring unit of the present invention；

Fig. 6 is a kind of structural schematic diagram of specific embodiment of phonetic order input system of the present invention；

Fig. 7 is a kind of structural schematic diagram of specific embodiment of signal analysis module of the present invention；

Fig. 8 is a kind of structural schematic diagram of specific embodiment of signal analysis module of the present invention；

Fig. 9 is a kind of structural schematic diagram of specific embodiment of signal analysis module of the present invention；

Figure 10 is a kind of structural schematic diagram of specific embodiment of phonetic order input system of the present invention；

Figure 11 is a kind of structural schematic diagram of specific embodiment of the second voice acquisition unit；

Figure 12 is that control instruction of the present invention records a kind of structural schematic diagram of specific embodiment of system；

Figure 13 is a kind of structural schematic diagram of specific embodiment of video conferencing system of the present invention；

Figure 14 is a kind of structural schematic diagram of specific embodiment of speech analysis processing system of the present invention.

In figure: phonetic order input system 1, pronunciation receiver 11, pronunciation receiver control subsystem 12, first are believed Number acquiring unit 121, control unit 122, spokesman's focusing subsystem 13, second signal acquiring unit 131, video acquisition unit 1311, the first voice acquisition unit 1312, signal generation unit 132, signature computation unit 133, pronunciation receiver control are single Member 134, warning subsystems 14, third signal acquiring unit 141, abnormal meeting-place determination unit 142, signal acquisition module 1421, Signal analysis module 1422, first handle subelement 14221, first determine subelement 14222, second processing subelement 14223, Second statistics subelement 14224, second determines that subelement 14225, speech recognition subelement 14226, third determine subelement 14227, third processing subelement 14228, the 4th determine subelement 14229, reminding unit 143, echo processing subsystem 15, the Two voice acquisition units 151, voice obtain module 1511, audio frequency vibration module 1512, speech detection module 1513, session control Center 1514, speech analysis processing system 2, meeting station control system 3, control instruction record system 4, mentions echo processing module 152 It takes unit 41, index point generation unit 42, complete unit 43, interaction and edit cell 44.

Specific embodiment

With reference to the attached drawing in the embodiment of the present invention, technical solution in the embodiment of the present invention carries out clear, complete Ground description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Based on this The embodiment of invention, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, belongs to protection scope of the present invention.

Fig. 1 is a kind of structural schematic diagram of embodiment of Video Conference Controlling System of the present invention, with reference to Fig. 1, video council View control system includes phonetic order input system 1, speech analysis processing system 2, meeting station control system 3 and control instruction note Recording system 4, the phonetic order input system 2 receives the phonetic order in each meeting-place, and phonetic order is transferred to voice point Processing system 2 is analysed, the speech analysis processing system 2 identifies phonetic order, and issues control signal to meeting station control system 3, The meeting station control system 3 controls the equipment in meeting-place after receiving control signal；The control instruction records system In each time point of the system 4 in multiple time points on the time of meeting line, the instruction in each meeting-place is extracted based on configuration file Information, and interacted or edited according to the command information of extraction.

Fig. 2 is the structural schematic diagram of the first embodiment of phonetic order input system 1 of the present invention, the phonetic order Input system 1 includes multiple pronunciation receivers 11, and is provided with multiple pronunciation receivers 11, institute in meeting-place of each attending a meeting Stating pronunciation receiver 11 can be the fixed equipment being fixed in each meeting-place, or movable equipment, or it is fixed The combination of equipment and movable equipment.In this scheme, due to being provided with multiple pronunciation receivers 11, Mei Gehui in meeting-place of attending a meeting Participant can participate in the scheduling in meeting-place, eliminate the scheduling institution for being responsible for scheduling meeting-place specially, make meeting into Cheng Gengwei is smooth, and can be regulated and controled according to practical meeting situation, it is not necessary to mechanical according to the commander of scheduling institution View, meanwhile, multiple pronunciation receivers 11 in each meeting-place also make the equal energy of control instruction of the participant in different location Clearly received.Further, in the present embodiment, the pronunciation receiver 11 is also simultaneously speech device.The party Case mutually unifies the pronunciation receiver 11 in conference control system with the speech equipment in meeting, has saved the use of equipment, The operation for switching distinct device in meeting is also eliminated simultaneously, keeps entire video conference more smooth.

Fig. 3 is the structural schematic diagram of the another embodiment of phonetic order input system 1 of the present invention, the phonetic order Input system 1 further includes pronunciation receiver control subsystem 12, and the pronunciation receiver control subsystem 12 includes first Signal acquiring unit 121, control unit 122, first signal acquiring unit 121 are set in each meeting-place.In meeting-place, The region that each participant may take a seat can be arranged pronunciation receiver 11, and a kind of conventional arrangement is each seat At least one corresponding pronunciation receiver 11, and the corresponding relationship of seat and pronunciation receiver 11 is defaulted into control unit In 122.First signal acquiring unit 121, the location information of participant is obtained when meeting starts, and the information is transmitted Into control unit 122；Described control unit 122, according to the location information of participant, by the language in corresponding setting range Sound reception device 11 is opened.The setting of pronunciation receiver control subsystem 12 can make can not temporarily to use in meeting-place Pronunciation receiver 11 is not opened at first with meeting, has saved electric cost, also improves the use of pronunciation receiver Rate.

Fig. 4 is the structural schematic diagram of another embodiment of phonetic order input system 1 of the present invention, the phonetic order Input system 1 further includes spokesman's focusing subsystem 13, and spokesman's focusing subsystem 13 includes second signal acquiring unit 131, signal generation unit 132, signature computation unit 133, pronunciation receiver control unit 134.

As shown in figure 5, the second signal acquiring unit 131 include video acquisition unit 1311, for obtain it is multiple with The video information of meeting person；The second signal acquiring unit 131 further includes the first voice acquisition unit 1312, for obtaining meeting Audio-frequency information, it is preferred that first voice acquisition unit 1312 be the pronunciation receiver 11.

The signal generation unit 132, to video acquisition unit 1311 obtain video information in, each participant's voice The relevant visual signal of activity is detected respectively, generates the visual activity detection signal to match with each participant, such as VVADl, VVAD2, VVAD3 etc.；Usually along with the rapidly, continuously movement of its mouth, which leads the floor status of spokesman The consecutive variations of lip interval area are caused, therefore in a kind of scheme, the visual activity is preferably the lip activity of participant Mode, the video acquisition unit 1311 carry out independent visual activity detection, the video acquisition to multiple participants respectively Unit 1311 obtains lip outline by the difference of lip and face's color, and the gap based on upper lower lip is in brightness, face Difference on color determines the area in lip gap.When difference of the area in the successive frame of video is more than preset threshold value When, the visual activity detection signal output of the participant is " 1 ", and otherwise, the visual activity detection signal output of the participant is "0"；The audio-frequency information obtained simultaneously to the first voice acquisition unit 1312 detects, to generate voice activity detection signal AVAD, first voice acquisition unit 1312 are used to believe by detecting the audio-frequency information to obtain the voice activity detection Number；When, there are when voice, the voice activity detection signal output is " 1 ", otherwise, the voice activity detection in audio-frequency information Signal output is " 0 ".

The signature computation unit 133, for by the multiple visual activity detection signal respectively with the speech activity Detection signal is compared, and will be detected corresponding to signal with the highest visual activity of the voice activity detection signal degree of correlation Participant be determined as current speaker.In a kind of scheme, the signature computation unit 133 uses comparison circuit, comparator Equal components show that each visual activity detection signal VVAD1, VVAD2, VVAD3 etc. are related to voice activity detection signal AVAD Degree, and the maximum participant of the degree of correlation is determined as spokesman.

The pronunciation receiver control unit 134, the spokesman for receiving signature computation unit 133 determine as a result, to meeting Pronunciation receiver 11 in is controlled, so that being transferred in system of being more clear of the voice of spokesman.It is described It can be closing and the incoherent pronunciation receiver 11 of the spokesman to the control method of pronunciation receiver 11；It can also be with To control pronunciation receiver 11 all in meeting-place, making these pronunciation receivers towards the spokesman.

The setting of spokesman's focusing subsystem 13, being transferred to of can enabling that the control instruction of spokesman is more clear are In system, the accuracy of entire conference control system is improved.

Fig. 6 is the structural schematic diagram of the third embodiment of phonetic order input system 1 of the present invention, the phonetic order Input system 1 includes warning subsystems 14, and the warning subsystems 14 are true including third signal acquiring unit 141, abnormal meeting-place Order member 142, reminding unit 143.

The third signal acquiring unit 141, for obtaining the sound in each meeting-place in video conference in preset time section At least one of frequency information, video information conference signal, in the present embodiment, the audio-frequency information can be filled by phonetic incepting Set 11 acquisitions.The exception meeting-place determination unit 142, each meeting-place for being obtained to the third signal acquiring unit 141 Conference signal is analyzed, and determines the related meeting-place for influencing meeting order.Exception meeting-place determination unit 142, including signal Module 1421 is obtained, for obtaining the audio-frequency information in each meeting-place in preset time period；The exception meeting-place determination unit further includes Signal analysis module 1422 is analyzed for the audio-frequency information to each meeting-place in the preset time period, and determining influences meeting The abnormal meeting-place of order.The reminding unit 143 is reminded for reminding the abnormal meeting-place for influencing meeting order Mode can be the modes such as voice, text, can also prevent its influence from referring to the pronunciation receiver 11 in temporary close exception meeting-place The identification of order.

A kind of embodiment of the signal analysis module 1422 is as shown in figure 8, its control flow are as follows:

S101, the first processing subelement 14221 obtain the audio status in each meeting-place, institute according to the audio-frequency information in each meeting-place Stating audio status includes talk situation and non-talk situation；S102, the first judgement subelement 14222, which is worked as, detects two or two When above meeting-place audio status is talk situation, described two or more than two meeting-place are determined to influence meeting order Abnormal meeting-place.

Specifically, in S101, the acquisition of each meeting-place voice status is specifically determined according to the audio-frequency information in each meeting-place each Whether meeting-place is in the voice status of speech, for a certain meeting-place, at a time, if being determined as according to audio-frequency information When voice, then it can determine that voice mobility of the meeting-place at the moment is 1, indicate that meeting-place is in talk situation, otherwise, voice is living Dynamic degree is 0, indicates nobody's speech in meeting-place, is non-talk situation.For S102, by taking the meeting with 3 meeting-place as an example, come Illustrate the voice status in each meeting-place, if in a certain period of time, meeting-place 1 and meeting-place 2 are in the state alternately talked, this can recognize To be that the people in two meeting-place is in the state alternately made a speech, the instruction control of entire meeting is in normal condition；If a certain Period, meeting-place 1 and meeting-place 3 are in while the state of speech, it is believed that in this stage, meeting-place 1 and meeting-place 3 are in influence The state of meeting order.

Second of embodiment of the signal analysis module 1422 is as shown in figure 9, its control flow are as follows:

S201, second processing subelement 14223 obtain the audio status in each meeting-place, institute according to the audio-frequency information in each meeting-place Stating audio status includes talk situation and non-talk situation；S202, the second statistics subelement 14224 count the audio in each meeting-place State is the speech duration in several meeting-place of talk situation；S203, the second judgement subelement 14225 calculate several meeting-place The ratio for the duration and the preset time section of talking, and when the ratio is greater than pre-set ratio threshold value, which is determined For abnormal meeting-place candidate meeting-place；S204, the language in the audio-frequency information in 14226 pairs of speech recognition subelement abnormal meeting-place candidate meeting-place Sound carries out the identification of voice to text；S205, third determine the language in the abnormal meeting-place candidate meeting-place that subelement 14227 will identify that The corresponding text of sound is compared with preset keyword, and the abnormal meeting-place candidate meeting-place for not occurring keyword is judged to influencing The abnormal meeting-place of meeting order.

Specifically, a period of time section can be preset in S203, that is, participant issues the routine of control instruction Time span illustrates the meeting when the ratio of the speech duration in certain meeting-place and Conventional Time length is more than the threshold value of a certain setting Field talk time is too long, in fact it could happen that the non-controlling instruction speech such as participant's chat；In S205, meeting can be preset to be begged for By the keyword of content, after S204 identifies the corresponding text of voice in each meeting-place, so that it may it is compared with Key word voice, When in meeting-place personnel discuss content be not related to, i.e., when not including the Key word voice, it may be determined that the meeting-place discussing with The meeting-place then can be judged to influencing the abnormal meeting-place of meeting order by the unrelated content of control instruction.For example, the view of a certain meeting The entitled power saving that power Transmission process is discussed, in this way, can determine that some controls refer in advance according to meeting subject under discussion for the subject under discussion The keyword of order, such as speaker's information, agenda, discussion topic, screens switch, meeting tea are had a rest, in this way, in meeting After beginning, so that it may to the voice in each meeting-place carry out identification and semantic analysis, when discovery participant speech information in do not include pre- When the keyword being first arranged, then it is assumed that the topic of the discussion in corresponding meeting-place is unrelated with that can discuss control instruction, to influence meeting order Abnormal meeting-place, which can be reminded.

The third embodiment of the signal analysis module 1422 is as shown in Figure 10, control flow are as follows:

S301, third handle subelement 14228 according to the audio-frequency information in each meeting-place, obtain the audio volume in each meeting-place； The meeting-place that audio volume is greater than default volume threshold is judged to influencing meeting order by S302, the 4th judgement subelement 14229 Abnormal meeting-place.

Specifically, in S302, it can be according to the volume in each meeting-place, whether the speech to determine each meeting-place is normal, such as volume It is excessively high, then it is assumed that not to be normal control instruction, it may be possible to which therefore that quarrel or confused noise etc. can determine the excessively high meeting-place of volume For influence meeting order related meeting-place, and on these influence meeting orders related meeting-place remind.Sound can be preset Threshold value, such as 90 decibels or 100 decibels are measured, when the volume in meeting-place is more than the default volume threshold, so that it may determine meeting-place volume It is excessive.

The setting of warning subsystems can accurately determine the bad meeting-place of order, and propose to warn to it, or even temporarily right Its pronunciation receiver 11 is closed, this has not only saved the cost for maintaining meeting-place order, while also reducing video conference The external interference of control.

Figure 10 is the structural schematic diagram of the another embodiment of phonetic order input system 1 of the present invention, and the voice refers to Enable input system 1 include echo processing subsystem 15, the echo processing subsystem 15 include the second voice acquisition unit 151, Echo processing module 152.

The structure of second voice acquisition unit 151 is as shown in figure 12, including multiple voices obtain module 1511, audio Shock module 1512, speech detection module 1513, session control center 1514, each voice obtain module 1511 with a sound Frequency shock module 1512 is connected with a speech detection module 1513, in the present embodiment, second voice acquisition unit 151 For pronunciation receiver 11.

The speech detection module 1513 detects the audio-frequency information that corresponding voice obtains module 1511, is sent to session control Center 1514 processed；The audio frequency vibration module 1512 detects the audio frequency vibration information that corresponding voice obtains module 1511, is sent to Session control center 1514；The session control center 1514 receives the audio-frequency information of speech detection module 1513, with database Comparison, whether the audio information content includes preset audio, including when preset audio, sends and closes microphone instruction to corresponding language Sound obtains switch module；The corresponding voice obtains switch module and receives instruction and close corresponding voice acquisition module 1511, The vibration information of audio frequency vibration module transmission is continued to, sends and opens microphone instruction to voice acquisition switch module, institute's predicate Sound obtains switch module and receives open command and open corresponding voice acquisition module 1511；When not including preset audio, continue Speech detection module information is received, and does not receive the vibration information of audio frequency vibration module；The session control center 1514 receives The voice messaging of speech detection module, and it is sent to echo processing module 152.

The echo processing module 152 receives audio-frequency information and simultaneously eliminates echo, send the audio-frequency information after eliminating echo to Adaptive-filtering module；The adaptive-filtering module: the audio-frequency information of echo processing module is received, is sent to after filtering processing Speech analysis processing system 2.

In the meeting of some plays, since related participant is less, cause meeting-place spacious, the voice of spokesman can be in meeting Echo is formed in, this causes very big influence to the speech recognition of conference control system, the setting of echo processing subsystem 15, The influence for reducing echo improves the accuracy of meeting-place control.

Figure 14 is a kind of specific embodiment of speech analysis processing system 2 provided in the present invention, unless otherwise instructed, Other embodiments are all made of this method and carry out speech analysis processing in the present invention.

In speech recognition process, with the increase of word quantity in vocabulary, a possibility that selecting wrong word, may also Increase.In order to improve, speech recognition system must be by reducing vocabulary while improving the accuracy of speech-to-text conversion Size becomes more intelligent.A kind of mode for reducing vocabulary is the vocabulary of the personalized system, for example, system can be by pre-add The vocabulary in field described in meeting is carried, the field is, for example, petroleum, electric power or intellectual property industry.Reduce the another of vocabulary size Kind mode is that vocabulary is carried out personalization for particular individual.For example, by being harvested from participant's used terminal device intelligence Network data, to create personal vocabulary.

Voice or audio are received at one or more endpoints.Decoder receive from acoustic model, dictionary model with And the input of language model 107, to decode the voice.Voice 101 is converted into text by decoder 10, and the text is as word grid Output.Decoder can also calculate confidence score, and confidence score can be confidence interval.

Voice can be analog signal.The analog signal can be different sampling rate (that is, the sample number of each second, most Commonly: 8kHz, 16kHz, 32kHz, 44.1kHz, 48kHz and 96kHz) and/or different every sample bits are (most common : 8 bits, 16 bits or 32 bits) it encodes.

One or more of acoustic model, dictionary model and language model are storable in decoder, or can be from External data base receives.

Acoustic model can be created according to the statistical analysis for the writing record opened up to voice and human hair.The statistics credit Analysis is related to forming the sound of each word.Acoustic model can be from the program creation of referred to as " training ".In training, user is to voice Identifying system says specified word.Dictionary model 105 is pronunciation vocabulary.For example, in the presence of that can pronounce not to same word Same mode.For example, word " electric power " has different pronunciations from In Fujian Province in Shandong District.Speech recognition system utilizes dictionary mould Type identifies various pronunciations.Acoustic model, language model are optional system.

Language model limits word and appears in the probability in sentence.For example, speech recognition can be " defeated by speech recognition system Send " or " comfortable ", every kind of possibility is with equal likelihood.However, if subsequent word is identified as " electric power ", It is " conveying " rather than " comfortable " that language model, which then shows that word in the early time has very high probability,.Language model can be from text data Building.Language model may include the probability distribution of sequence of terms.The probability distribution can be conditional probability (that is, in another word The probability of the next word of the case where language occurs).

Decoder can convert audio or voice in ongoing meeting.In this way, the spy that the session occurs Determine viewpoint or text is quickly recorded.

Decoder can be the network equipment, such as cloud computing center.Decoder include controller, memory, database and Communication interface, the communication interface include input interface and output interface.Input interface receives the voice from endpoint.Output interface Decoded text can be provided to external data base or search engine.Alternatively, decoded text can store in data Library.

One or more of acoustic model, dictionary model and language model can be stored in memory or database. Memory can be the volatile memory or nonvolatile memory of any known type.Memory may include read-only memory (ROM), dynamic random access memory (DRAM), static random access memory (SRAM), programmable random access memory (PROM), flash memory, electronic erasable programmable read-only memory (EEPROM), static random access memory (RAM) or other classes One or more of type memory.Memory may include light, magnetic (hard disk drive) or the data of any other form Storage device.What memory can either can be removed in remote-control device, such as secure digital (SD) storage card.

Database can be set decoder outside or be comprised in decoder.Database can by memory Lai Storage or individually storage.Database can be hardware or software form.

Memory can store computer executable instructions.Controller can execute the executable instruction of computer.It calculates Machine executable instruction may include in computer code.Computer code can be stored in memory.Computer code can appoint What computer language is write, such as C, C++, C#, Java, Pascal, VisualBasic, Perl, hypertext markup language (HTML), JavaScript, assembler language, extensible markup language (XML) and any combination thereof.

Computer code can be coding in one or more tangible mediums or one or more nonvolatile tangible mediums In so as to logic performed by controller.Coding is in one or more tangible mediums so that can be defined as can for the logic of execution Instruction performed by controller, and these instructions are that computer-readable storage medium, memory or their combination above mention It supplies.For command net equipment instruction be storable in it is any in logic.As used herein, described " logic " includes But it is not limited to hardware, firmware, the software executed on machine and/or respective combination, for realizing (one or more) function Or (one or more) movement, and/or facilitate function or movement from another logic, method and/or system.Logic can wrap Include the microprocessor of such as software control, ASIC, analog circuit, digital circuit, the logic device of programming and comprising instruction Memory device.

Instruction is storable on any computer-readable medium.Computer-readable medium can include but is not limited to floppy disk, Hard disk, specific integrated circuit (ASIC), compact-disc CD, other optical mediums, random access memory (RAM), read-only memory (ROM), storage chip or card, memory stick and computer, processor or other electronic equipments can read from its His medium.

Controller may include general processor, digital signal processor, specific integrated circuit, field-programmable gate array Column, analog circuit, digital circuit, processor-server, above-mentioned items combination or other are currently known or develop later Processor.Controller can be the combination of single device for example related with network or distribution process or multiple devices.This Outside, those skilled in the art will appreciate that, controller can realize Viterbi (Viterbi) decoding algorithm for speech recognition. Any strategy in various processing strategies, such as multiprocessing, multitask, parallel processing, long-range processing, centralized processing can be used Etc..Controller can be responded or be operable to execute and deposit as software, hardware, integrated circuit, firmware, microcode etc. The instruction of storage.Function, movement, method or the task for being shown in the accompanying drawings or being described herein can be stored in by execution The controller of instruction in reservoir executes.These functions, movement, method or task are independently of instruction set, storage medium, processing The concrete type of device or processing strategie, and can be by the software, hardware, integrated circuit, solid that independently or in combination runs Part, microcode etc. execute.These instructions are to realize processing, technology, method or movement described herein.

It will be appreciated by those skilled in the art that pronunciation receiver control subsystem 12, spokesman's focusing subsystem 13, police Show that subsystem 14, echo processing subsystem 15 can be selected according to actual needs, and be can be used in combination, and the group of above-mentioned subsystem Usage mode is closed, the present invention is simultaneously not particularly limited.It should also be appreciated by one skilled in the art that being combined in above-mentioned subsystem In the case where use, it may appear that more embodiments, these embodiments also fall into this hair without departing from the principle of the present invention In the range of bright protection.

Further, the Video Conference Controlling System further includes control instruction record system 4, the control instruction record System 4 includes: extraction unit 41, index point generation unit 42, completes unit 43, interaction and edit cell 44.Wherein, S401, Extraction unit 41 in each time point on the time of meeting line in multiple time points, extracts each meeting-place based on configuration file Key message, wherein the time of meeting line and meeting time correlation join, the configuration file is used for the instruction of conference setup Information；The command information in each meeting-place is combined into crucial index point, the key rope by S402, index point generation unit 42 Draw the index point for being a little used as and being interacted or edited with instruction, i.e., the key message group in each meeting-place is combined into corresponding to institute State the crucial index point of all information contained by key message；S403, the multiple of multiple time points will be corresponded to by completing unit 43 Crucial index point generates instruction record；S404, interaction with edit cell 44 according to described instruction record in key message and institute Instruction record is stated to be interacted or edited.

In S401, it is however generally that, configuration file include voice, video detection and identification module, key message extraction module, Event determines and analysis module.Key message includes one or more of following information: face, limb action, voice, key Frame, customized event.Wherein so-called customized event can be some special events in instruction control, for example including such as finger The scenes such as show, refuse, arguing, also may include other customized things.The format of instruction record is text file, audio File, video file, flash file or PPT file.

In S402, for example, including face, voice in the key message that configuration file defines, then extracting in each meeting-place Corresponding to the face key message and voice key information at a time point on the time of meeting line, then by face key Information and voice key information are combined into a crucial index point.

In S403, on the basis of crucial index point, the crucial index point that multiple time points are generated is combined together just Generate the instruction record of the video conference.Specifically, the mode of instruction record is generated, according to certain motor pattern multiple Crucial index point is together in series.

In S404, in order to obtain more complete instruction record, participant can be recorded according to described instruction in crucial letter Breath is interacted or is edited with described instruction record.The mode of this interaction and editor can be participant's click commands record In name when, the brief information of the people is displayed in real time out, or provide further reference key, so that participant is to this Instruction is verified.

Control instruction records the setting of system 4, and the key that participant can be assisted to record in entire conference process refers to It enables, auxiliary participant summarizes conference process, and by the interpretation to key instruction, participant can analyze out some non-meetings The relevant content of content, if which meeting-place order is good, the instruction which meeting-place issues is more effective etc..

Those of ordinary skill in the art may be aware that list described in conjunction with the examples disclosed in the embodiments of the present disclosure Member and algorithm steps can be realized with the combination of electronic hardware or computer software and electronic hardware.These functions are actually It is implemented in hardware or software, the specific application and design constraint depending on technical solution.Professional technician Each specific application can be used different methods to achieve the described function, but this realization is it is not considered that exceed The scope of the present invention.

It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.

In several embodiments provided herein, it should be understood that disclosed systems, devices and methods, it can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit It divides, only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple units or components It can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown or The mutual coupling, direct-coupling or communication connection discussed can be through some interfaces, the indirect coupling of device or unit It closes or communicates to connect, can be electrical property, mechanical or other forms.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.

It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.

It, can be with if the function is realized in the form of SFU software functional unit and when sold or used as an independent product It is stored in a computer readable storage medium.Based on this understanding, technical solution of the present invention is substantially in other words The part of the part that contributes to existing technology or the technical solution can be embodied in the form of software products, the meter Calculation machine software product is stored in a storage medium, including some instructions are used so that a computer equipment (can be a People's computer, server or network equipment etc.) it performs all or part of the steps of the method described in the various embodiments of the present invention. And storage medium above-mentioned includes: that USB flash disk, mobile hard disk, read-only memory (ROM, Read-OnlyMemory), arbitrary access are deposited The various media that can store program code such as reservoir (RAM, RandomAccessMemory), magnetic or disk.

The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto, Within the technical scope of the present disclosure, any changes or substitutions that can be easily thought of by anyone skilled in the art, It should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with the protection model of claims Subject to enclosing.

Claims

1. a kind of Video Conference Controlling System based on speech recognition, which is characterized in that including phonetic order input system (1), Speech analysis processing system (2), meeting station control system (3) and control instruction record system (4), the phonetic order typing System (1) is connect with the speech analysis processing system (2), the speech analysis processing system (2) and the meeting-place Control system (3) connection；The meeting station control system (3) is connect with control instruction record system (4)；

The phonetic order input system (1) is used to receive the phonetic order in each meeting-place, and phonetic order is transferred to language Sound analysis process system (2)；The speech analysis processing system (2) identifies phonetic order, and to meeting station control system (3) hair Signal is controlled out；After the meeting station control system (3) receives control signal, and according to control signal to the equipment in meeting-place Issue control instruction；Control instruction record system (4) is for each of multiple time points on the time of meeting line On time point, the command information in each meeting-place is extracted based on configuration file, and interacted or compiled according to the command information of extraction Volume；

The phonetic order input system (1) includes pronunciation receiver (11), pronunciation receiver control subsystem (12) And spokesman's focusing subsystem (13)；The pronunciation receiver (11) is equipped in multiple and meeting-place of each attending a meeting and is respectively provided with There is at least one pronunciation receiver (11)；The pronunciation receiver control subsystem (12) includes the first signal acquisition list First (121) and control unit (122)；Spokesman's focusing subsystem (13) include second signal acquiring unit (131), Signal generation unit (132), signature computation unit (133) and pronunciation receiver control unit (134).

2. a kind of Video Conference Controlling System based on speech recognition according to claim 1, which is characterized in that Ge Gehui Pronunciation receiver (11) in is equipped with no less than three, is used to receive the voice messaging of participant.

3. a kind of Video Conference Controlling System based on speech recognition according to claim 1, which is characterized in that described First signal acquiring unit (121), for obtaining the location information of participant, the control unit (122) is used for first Pronunciation receiver (11) in setting range corresponding to participant's location information that signal acquiring unit (121) obtains is beaten It opens.

4. a kind of Video Conference Controlling System based on speech recognition according to claim 1, which is characterized in that described Second signal acquiring unit (131) includes video acquisition unit (1311) and the first voice acquisition unit (1312), described Video acquisition unit (1311) is used to obtain the video information of multiple participants；First voice acquisition unit (1312) is used In the audio-frequency information for obtaining meeting；The signal generation unit (132), to video acquisition unit obtain video information in, often The relevant visual signal of a participant's speech activity detects respectively, generates the visual activity to match with each participant and examines Survey signal；The audio-frequency information obtained simultaneously to the first voice acquisition unit (1312) detects, to generate voice activity detection Signal；The signature computation unit (133), for examining the multiple visual activity detection signal with the speech activity respectively It surveys signal to be compared, and will be detected corresponding to signal with the highest visual activity of the voice activity detection signal degree of correlation Participant is determined as current speaker；The pronunciation receiver control unit (134), receives the spokesman of signature computation unit Determine as a result, controlling the pronunciation receiver (11) in meeting-place, so that the biography that the voice of spokesman can be more clear It is handed in system.

5. a kind of Video Conference Controlling System based on speech recognition according to claim 1, which is characterized in that described Phonetic order input system (1) further includes warning subsystems (14), and the warning subsystems (14) include third signal acquisition list First (141), abnormal meeting-place determination unit (142) and reminding unit (143)；The third signal acquiring unit (141), is used for It obtains in preset time section, at least one of the audio-frequency information in each meeting-place, video information conference signal in video conference, The audio-frequency information is obtained by pronunciation receiver (11)；The exception meeting-place determination unit (142), for the third The conference signal in each meeting-place that signal acquiring unit (141) obtains is analyzed, and determines the related meeting-place for influencing meeting order；Institute Abnormal meeting-place determination unit (142), including signal acquisition module (1421) are stated, for obtaining the sound in each meeting-place in preset time period Frequency information；The exception meeting-place determination unit (142) further includes signal analysis module (1422), for the preset time period The audio-frequency information in interior each meeting-place is analyzed, and determines the abnormal meeting-place for influencing meeting order；The reminding unit (143), is used for The abnormal meeting-place for influencing meeting order is reminded.

6. a kind of Video Conference Controlling System based on speech recognition according to claim 5, which is characterized in that described Signal analysis module (1422) further includes that the first processing subelement (14221) and first determine subelement (14222)；Described First processing subelement (14221) obtains the audio status in each meeting-place, the audio for the audio-frequency information according to each meeting-place State includes talk situation and non-talk situation；Described first determines subelement (14222), when detect two or two with On meeting-place audio status when being talk situation, determine that described two or more than two meeting-place are influence meeting order different Normal meeting-place.

7. a kind of Video Conference Controlling System based on speech recognition according to claim 1, which is characterized in that described Phonetic order input system (1) further includes echo processing subsystem (15)；The echo processing subsystem (15) includes second Voice acquisition unit (151) and echo processing module (152)；Second voice acquisition unit (151) includes several Voice obtains module (1511), audio frequency vibration module (1512), speech detection module (1513) and session control center (1514), The voice obtain module (1511) with an audio frequency vibration module (1512) and a speech detection module (1513) it is connected；The speech detection module (1513) obtains the audio letter of module (1511) for detecting corresponding voice Breath, is sent to session control center (1514)；The audio frequency vibration module (1512) obtains module for detecting corresponding voice (1511) audio frequency vibration information is sent to session control center (1514)；Volume session control center (1514) receives simultaneously The audio-frequency information of speech detection module (1513) and the audio frequency vibration information of audio frequency vibration module (1512) are handled, and is sent to back Sound processing module (152)；The echo processing module (152) receives audio-frequency information and eliminates echo, sends after eliminating echo Audio-frequency information to adaptive-filtering module；The adaptive-filtering module receives the audio letter of echo processing module (152) Breath, is sent to speech analysis processing system (2) after filtering processing.

8. a kind of Video Conference Controlling System based on speech recognition according to claim 1, which is characterized in that described Control instruction record system (4) include extraction unit (41), index point generation unit (42), complete unit (43) and interaction with Edit cell (44)；The extraction unit (41), for each time point in multiple time points on the time of meeting line On, the command information in each meeting-place is extracted based on configuration file, wherein the time of meeting line and meeting time correlation join, it is described Configuration file is used for the command information of conference setup；The index point generation unit (42), for by each meeting-place Command information is combined into crucial index point, and the key index point is used as the index point for being interacted or being edited with instruction record； The completion unit (43), for the multiple crucial index points for corresponding to multiple time points to be combined into instruction record；It is described Interaction and edit cell (44), recorded for key message and the described instruction in being recorded according to described instruction interacted or Editor.