Keywords

1 Introduction

In an increasingly connected world, the search for information by the people and the concern to reach all audiences by the content generators is greater every day. In a world like this the need to make information and content accessible to everyone is growing. For a long time, people have sought to conduct research, develop methods and applications that make the lives of people with some type of disability simpler and more integrated with the rest of the society.

A research of 2018 showed that even today, the main source of news in the US is the television [1]. The Fig. 1 shows the statistics of the main source of news for the US viewers, where he television is the main source, followed by the news on internet. Even with people using the internet more and more in their day-to-day lives, for the purposes of keeping up with the news people still choose television stations as their main source. This shows that the virtual sources of information still need to acquire the trust of a large part of the public that consume content, and in the meantime, television broadcasters and the television sets themselves need to evolve to become increasingly accessible to all audiences.

Fig. 1.
figure 1

Statistics of the main source of news for the US viewers [1]

Nowadays, several TV broadcasters are transmitting content containing sign language interpreter attached on the main video. The benefit of this approach is that the interpreter already comes synchronized with the video, not begin necessary to use any synchronization tool on the receptor side. On the other hand, as the interpreter is already attach on the media content, the user doesn’t have the option to enable or disable such element. Besides that, there is no regulation about the position and size of the interpreter to be used, at last in Brazil. Thereby, each broadcaster uses its own standards, which compromises the experience of the viewer, and in some cases, when the interpreter is too small, prevents the access to content.

The TV manufactures, together with the broadcasters, are working on different approaches which often impact on the difficulty of adding content on the broadcaster side and, in other cases, the difficulty of the receiver manufacturer providing more memory to store the set of images needed for each sign language gesture. In addition, it is worth remembering that each language has its own sign language version, so, storing the gestures for several languages looks to be impracticable.

One of the technologies that has evolved the most and has shown the most promise for solving problems in the most diverse areas is machine learning. The ability of certain algorithms to extract knowledge from data in a way unimaginable for us humans has made possible its use in several areas of knowledge and its use to solve problems previously extremely difficult to handle. Practically all sectors of the industry have been trying to insert concepts/solutions that use artificial intelligence/machine learning in their products in order to add value and bring a better experience to users.

With that in mind, this paper presents a machine learning based solution for detection, segmentation and magnification of a sign language interpreter in a live broadcast content. This article seeks to encourage the scientific community to research and develop solutions that improve the lives of people with disabilities. In addition, an attempt is made to present a solution of simple integration with TV equipment that can be applied to the most diverse models.

The paper here presented is organized as follows. Section 3 the most relevant related works are presented. Section 4 presents the proposal itself, explaining how it works. Section 5 shows the results achieved. Section 6 concludes this paper and propose some new topics for development.

2 Background

In this section some relevant background information for this paper will be presented. The following subsections will present some aspects about the digital television, sign languages and machine learning.

2.1 Digital TV

The digital television is a broadcasting system capable of transmitting digital signals, which provide a better signal reception, higher resolution and better sound quality, when compared to the previous analogue signal. Furthermore, the digital TV has tools that allow the integration between the broadcast signal and data transmission and internet access [2].

In the old days, the televisions were dedicated devices developed exclusively for the reception and treatment of broadcast signals. Nowadays, televisions have become real computers, with processors, operating system, input and output interfaces, so on. This change allows today’s televisions to perform tasks never before imagined, like internet access, real-time interactive systems [3,4,5,6], access to stream platforms, and at the same time, it allows that applications aimed at the disabled can be developed [4, 7].

There are three major digital TV standards in the world, namely: ATSC, DVB, ISDB. Others were developed based on these three. A standard defines some fundamental characteristics for the functioning of the TV system [8]. Some of these characteristics are: the transmission mode, the transportation mode, the compression system and the middleware (interactive software) features [3].

The Fig. 2 shows the schematic of a digital TV broadcasting system.

Fig. 2.
figure 2

Schematic of a digital TV broadcast system [9]

The multimedia content to be transmitted is divide into video and audio packages. Both of them are than encoded using the method defined by the standard (some MPEG2 or MPEG4 variant). After that, these packages are broken into pieces and multiplexed with some data packages into a single transport stream, where each part is identified and some synchronization data is also inserted. The next step is the modulation of the signal, where the transport stream is converted into the electrical signal, with the frequency respecting the range intended for the broadcaster that will be transmitted. In this steps some error correction codes are used to improve information reliability.

On the viewer side, the signal is received by the antenna and demodulated back to the transport stream. Each data package is analysed and separated by its type, reassembling the audio, video and data content. The audio and video are decoded and displayed on the TV according to the synchronization data. Some of the data content can be part of the interactive system and will be processed by the middleware. The same system can also receive or send additional data through the internet, in a process called broadband.

2.2 Sign Language

Sign language is a non-verbal language that uses hand and body movements, including postures and facial expressions, used by and to people who are deaf or that cannot talk [10, 11]. Contrary to what many people think, there is no universal sign language. As well as the spoken languages, each country or region has its own sign language with its own peculiarities and characteristics. Furthermore, the sign languages are not based on the oral languages. Besides having its own structure and grammar, the sign languages are capable of convey even the most subtle and complex meanings as well as oral languages [12].

Just like the spoken language, the sign languages have different dialects inside the same language, based on the geographic region, social groups and even genre. In addiction to that, situational dialects are also found on sign language, for example, the signing style used in a formal occasion differs from the one used in a causal/informal situation [12].

Because of such complexity, it is very hard to interpret the behaviour of a real person using sign language. Several works have already been done [13,14,15,16] trying to develop an artificial sign language interpreter. Such proposals are valid, however, none has yet been well received by the deaf public, since most applications are limited to simulating gestures, not being able to perform the movements and facial expressions necessary for correct communication by the deaf people.

2.3 Machine Learning

Machine learning consists of using computational models that have the ability to learn to identify and classify certain characteristics or scenarios using great amount of data or a specific environment. There are three main categories of learning algorithms: supervised, unsupervised and reinforcement learning. In supervised learning, the main objective is to learn a mapping between the input and output data given a pair input-output [17]. After execute the model with the inserted data, the internal weights are updated to correct its error of judgment. In unsupervised training, only the input information is provided to the model and it must be grouped as having similar characteristics, in a process known as clustering. The main goal here is to identify interesting patterns in the provided data [17]. In reinforcement learning algorithms, the system freely interact with the environment, using some predefined moves and, receiving some results evaluation, through awards or punishments [17], adjusts its internal weights and learns how to behave on each situation.

The solution here proposed works with image and video content with the objective of detecting and performing the segmentation of an area on the image. Thereby, the best fitting algorithm category is the supervised. Next, some algorithms on this scope will be presented.

Haar Cascades. The Haar Cascade algorithm is a supervised machine learning model that uses classical computer vision concepts for its operation. Several small image filters, called haar features, must be defined by the developer in order to detect the desired forms and shapes on an image, such as lines, contours and so for. Once these features are defined, the classifier section of the model is trained using positive and negative samples of the desired element. Figure 3 shows some examples of haar features and its use on a real image.

Fig. 3.
figure 3

Examples of haar features and its use on images

A Haar Cascade algorithm is composed of hundreds or thousands of this haar features. If all these features are applied to an image, it will be highly time consuming. To save some time, the algorithm was implemented on a cascade way, where in each step only a few of these features are tested. If the image fail to pass a step, it is immediately discarded and will not run through all the remaining steps. If an image pass all the steps, a match was found [18]. Figure 4 shows an image which a face detected and an eyes detector haar cascade algorithms were used.

Fig. 4.
figure 4

Example using Face and Eye Haar Cascade detector

The main advantage of this algorithm it is capability of perform the training process with few data samples. As the features extraction is performed manually, only the classifier is trained, so, less data is needed. Beyond that, after trained, haar cascade algorithm generates a lightweight xml file containing all the needed data to run. On the other hand, as the extraction of characteristics is limited, its ability to generalize is compromised.

For a long time, this was the main algorithm for computer vision systems. With the improvement of the processing technology and the availability of huge amounts of data, the deep learning models took its place as the state of the art in the vision computing scenario.

Deep Learning. Deep learning algorithms are currently the state of the art when it comes to systems of vision, classification and detection of characteristics in images or videos. Deep learning is a sub-area of machine learning, which in turn is a branch of artificial intelligence. These computational models are characterized by the ability to extract the necessary attributes for the classification and detection of the objects to be identified, unlike traditional machine learning systems, where the classifiable properties need to be extracted and presented to the classifying model by the developer himself.

The techniques that allow deep learning models to function are not new [19, 20], however, due to several impediments, among them computational capacity and amount of data, they only became popular and became viable from 2012, when the AlexNET network [21] was champion ILSVRC, a famous computer vision tournament. Since then, increasingly sophisticated models have been developed that have allowed these systems to surpass the human capacity for image classification.

The models of deep learning used in computer vision are mostly convolutional neural networks (CNN), networks basically composed of trainable filters, also known as kernels, layers of condensation of characteristics (pooling), in addition to one or more layers of classification in output (dense), composed by artificial neurons fully connected [22, 23]. Figure 5 shows the described structure for a basic CNN. Among the models that employ deep learning for computer vision solutions, four major areas of application can be identified, namely: recognition/classification, detection, semantic segmentation and detection with segmentation. All of them supervised learning models.

Fig. 5.
figure 5

Convolutional neural network model [22]

The deep learning algorithms demand an immense amount of data for satisfactory results to be obtained in their training, since both the steps of extracting characteristics and the classification will be trained. In addition, this process requires a lot of time and computational processing power. Beyond that, the trained CNN model is normally a very heavyweight file, somethings achieving hundreds of megabytes. On the other hand, this type of solution achieves high accuracy and capacity for generalization.

Table 1. Related works table

3 Related Works

This section will present some works related to the presented paper. Table 1 shows an overview of the indicated related works.

The paper [24] presents a solution using the Microsoft Kinect, CNNs and GPU acceleration to recognize 20 italian sign language gestures. The dataset used was ChaLearn Looking at People 2014: Gesture Spotting. The system achieved 91.7% accuracy. [25] is a work where a Faster R-CNN model was used to perform the segmentation and identification of bangladeshi sign language gestures on real time. The average accuracy was 98.2%. The work presented on [26] shows an approach based on Principal Component Analysis (PCA) and KNearest Neighbors (KNN) to recognize pakistan sign language on images, achieving an accuracy of 85%. The paper [27] proposed a deep learning CNN + RNN model to recognize chinese sign language gestures on videos. The results showed an accuracy between 86.9% and 89.9%. Finally, [28] presents a CNN based approach to identify 100 different indian sign language static signs on images. The approach achieved accuracies up to 99%.

4 Methodology

Based on the aforementioned background and discussions, this section will present our proposal for sign language interpreter detection and segmentation for live broadcast content. The parts that make up this work will be presented and explained and some of the choices made will be justified.

First of all, it is necessary to define the technique used to detect the sign language interpreter on the screen. The presented techniques on the previous section were the Haar Cascade algorithm or a Convolutional Neural Network model. The CNN achieves higher accuracy and generalization capability, on the other hand, it needs a dedicated dataset to be trained and a great amount of space on the device data storage. Firstly, no dataset that met the application’s needs was found and secondly, the TV devices have very limited flash memory available, making it difficult to add content that requires a large volume of space. Because of that, in a first approach, the Haar Cascade algorithm was chosen for this system.

The modern television set, whose internal structure and functioning is practically identical to that of a computer, has support for most of the available multi-platform libraries. For this reason, the availability of some haar cascade face detector was verified within the OpenCV library. The library has a frontal face haar cascade detector, able to return the position of all frontal face matches on an image.

The frontal face detector itself is not enough to identify a sign language interpreter on the image, since it will get al.l the frontal faces on it. This result needs to be refined in order to identify the correct matching face. For this, a more in-depth research on the pattern of use of interpreters needed to be carried out.

After analyzing a large amount of television content, the most common patterns of use of sign language interpreters in the transmitted content were identified. Figure 6 shows these patterns. The vast majority insert the interpreter in some of the corners, with the bottom right corner being the most used. An interesting detail perceived in this analysis was the use, by some broadcasters, of the interpreter on the left side, in a more central position on the screen, as shown on Fig. 6, in the bottom right image.

Fig. 6.
figure 6

Most common sign language interpreter positioning on screen

Beyond the most common used patterns, some interpreters that are quite different from the standard normally used have been identified. Figure 7 shows an example of such pattern. This pattern uses a more prominent interpreter, occupying a considerable part of the image and having the height of the television screen. As this type of interpreter already has considerable size and prominence and a much greater visibility than the others, the treatment of this type of interpreter was disregarded in the development of the explained system.

Having identified the most used positioning patterns for sign language interpreters, these positions were mapped in order to enable the identification of the frontal faces in an image with the greatest potential to effectively being an interpreter.

Having knowledge of all this information, the flowchart for processing the method was prepared. Figure 8 shows the flowchart with all the steps needed to execute the solution.

Fig. 7.
figure 7

Out of standard sign language interpreter

Fig. 8.
figure 8

Flowchart for the proposed solution

The processing begins when the system receives a new video content stream. This content is the after decoding data, see Fig. 2, where the video content is ready to be displayed. At this point, a sample of the video is taken for analysis. With this screenshot image the frontal face detector is executed. A matrix containing all the identified frontal faces on the image is returned. As previously explained, identify a frontal face is not the guarantee of the presence of a sign language interpreter. Because of that, each area identified in the face detector is confronted with the mapped areas most probable to have an interpreter.

After that, only the matches are taken. Each mapped area received a probability value, being the bottom right area the greatest. So, the remaining areas are verified and the one with the greatest probability is chosen. At this point, if no area is remaining, it is considered that no interpreter is present in the image. If that is not the case, the selected area is taken and will be passed to the image controller system of the television, together with the desired zoom for the interpreter.

The image controller is the software on the TV that controls everything that is displayed on the screen. Each manufacturer has its own solution for it, and it is not the purpose of this paper to detail it. After that, the content to be display is built, positioning the original content with the magnified interpreter. Such step is also not the focus on this paper, because each manufacturer can solve it its own way, and with its own desired layout. The final step is show the content on the screen, containing or not the magnified sign language interpreter, depending on the result of the processing.

5 Results

This section will present the results achieved on the development of this work.

The system described on the previous section was implemented on a Linux based operation system running on a dedicated hardware platform. A platform native C++ application was developed to implement the described method. Figure 9 shows an example of the video content analysed, where the sign language interpreter was identified and later magnified.

After implemented, the system was tested using live broadcast content from different parts of the world, to verify its ability to detect the most diverse sign language interpreters. The system had a requirement to achieve a detection success rate of at least 80%. The tests showed that the proposed method achieved the required accuracy.

Regarding the execution time, the method was tested on several hardware platform models, models with lesser or greater computational capacity. The tests showed that the system took 2 to 3 s, depending on the platform used, to perform the verification and extraction of the sign language interpreter. As the developed application does not demand real-time verification and updates of the interpreter position, the time obtained was within the required limits.

Fig. 9.
figure 9

Resulting video content generated by the method

The method proved to be easy to integrate and adapt regardless of the platform used. In all the tested platforms the system behaved in a similar way and with good repeatability.

6 Conclusion

The search for information by the people and the concern to reach all audiences by the content generators is greater every day. In a world like this the need to make information and content accessible to everyone is growing. The sign language interpreters available on some TV content sometimes are not the best for a better appreciation of the displayed content.

This article sought to present a machine learning based method intended to identify the presence and the position of a sign language interpreter on the screen, take an instance of it and make it available for magnification, providing greater accessibility and experience improvement on watching TV for people with special needs.

The obtained results show that the method achieved the sought accuracy, which was greater than 80%, with an execution time between 2 to 3 s. As previously stated, for the application in question the execution time was not a critical factor. However, for future improvements, an upgrade on the performance of the system in relation to the execution time would be interesting, since time and processing demand are directly linked.

Future works could include the creation of a dedicated sign language interpreters database and the development of a deep learning convolutional neural network based solution for improvement on accuracy and time consumption.