1 Introduction

Communication is a crucial life skill (Light and McNaughton 2012). It provides people with the ability to express their needs and feelings, and establish bonds with others (Light 1989). However, people with various conditions, such as cerebral palsy (CP) and Autism Spectrum Disorders (ASD), can have different communication development phases (Light and McNaughton 2012; Pennington 2008; Mitchell et al. 2006). CP, of whom about one-third are non-verbal (Nordberg et al. 2013), is a common neurodevelopmental disorder with an incidence of 1.5 per 1,000 in Australia (Galea et al. 2019). The population of ASD is larger than CP, with about one in 160 children worldwide are diagnosed with ASD (GBD 2015). It is estimated that 30% of people with ASD have limited communication abilities even after age five (DiStefano et al. 2016). People with CP, ASD and other intellectual disabilities may have complex communication needs (CCN), and can experience difficulties and challenges in communicating with other people (Light and Drager 2007; Light and McNaughton 2012). As a result, Augmentative and Alternative Communication (AAC) can enhance and leverage existing communication capabilities for many people with CCN (Light and Drager 2007; Light and McNaughton 2012).

AAC is a set of approaches that aims to assist or replace the conventional communication method for people with CCN (Murray and Goldbart 2009). It includes a wide range of methods such as sign language (no-tech), to high-tech communication, such as computer-based speech synthesis system (Murray and Goldbart 2009). In recent AAC research, eye-tracking and Brain-Computer Interfaces (BCI) are two popular and emerging access methods and interfaces to AAC devices (Higginbotham et al. 2007). Eye-tracking technologies utilise eye features, such as movement, orientation, staring, and gaze, which are then interpreted and turned into control signals to instruct AAC devices (Lai et al. 2013). This approach is motivated by human natural action to gaze at an Area of Interest (AOI) (Lai et al. 2013). This fixation could be interpreted as a directed intention to control the AAC system (Lariviere 2015). The minimal physical movement requirement for eye-tracking makes it a reliable method of communication, especially to people with severe disabilities (Karlsson et al. 2018). Large amounts of research have demonstrated the effectiveness, usefulness and usability of this technology (Mele and Federici 2012). For people who, due to nystagmus or other conditions cannot control their eye movements, BCI with direct brain access may be an alternative access method (Allison et al. 2007). Rather than using eye movements, BCI use brain signals as inputs to feedback closed-loop control systems (Bulárka and Gontean 2016). BCI can be classified based on its sensor modality and type of control signal types (Ramadan and Vasilakos 2017). Amongst different combinations, the P300-based BCI is one of the most commonly used (Wu et al. 2019). P300 BCI is based on the physiological brain response recorded on the electroencephalograph (EEG) that has an amplitude shift at 300 milliseconds after an external stimulus (Spencer and Polich 1999). This stimulus can be auditory or visual, such as oddball sound or oddball flashes, respectively (Başar et al. 1984; Yagi et al. 1999). The conventional use of P300 is a BCI speller (Rezeika et al. 2018). The speller flashes the alphabetic keys in random order, and then the algorithm matches the flash time with the P300 appearance time and selects the corresponding alphabet letter (Lotte et al. 2018). The P300-based BCI could also be utilised in AAC by following the same procedure; the difference being selecting words from a communication board instead of a speller (Scherer et al. 2015).

The state-of-the-art AAC device designs are increasingly based on portable computers such as mobile phones, tablets, laptops and game consoles (Clarke and Price 2012). The common feature of these devices is an external screen display which can be easily seen by others. Subsequently, the privacy of users is being put in a vulnerable position. Additionally, even if the devices are somehow portable, they still need to be mounted on wheelchairs Karlsson et al. (2018). The portability and adaptability are thus questionable amongst current designs (Cooper et al. 2009; McNaughton and Bryen 2007). Nevertheless, our previous study implicated that Mixed Reality (MR) may have some merits in resolving the above-mentioned aspects (Zhao et al. 2021). MR was sometimes treated as an extension of Virtual Reality (VR) (Rokhsaritalemi et al. 2020). VR is a technology that puts users in an immersive environment where they can experience virtual worlds as realities (Sherman and Craig 2003), whilst MR is a hybrid environment which mixes reality with virtual objects(Tang et al. 2020). The other reality technology is Augmented Reality (AR) (Lungu et al. 2021). The taxonomy of these three types of technologies is controversial but they are all immersive environments(Liberatore and Wagner 2021). Although these three technologies share similar concepts, they have very different presences. The major variant is how they incorporate real-life environments through their technology. VR provides completely immersive environments in which the user cannot see real-life backgrounds, whilst AR glasses are usually transparent (Carmigniani and Furht 2011) and the AR device can overlay information or virtual objects above real-life backgrounds (Carmigniani and Furht 2011). Meanwhile, in MR, the virtual objects have inherent physical factors and can be interacted with as if they are real objects in real-life (Lungu et al. 2021).

Several studies have investigated VR and AR-based AAC. An AR-based alphabet and number selection game was introduced by Antão et al. (2020). There was a laptop with a selfie camera which would reflect the body movement. The system presented random alphabet and number keys around the participants, and the device would analyse the participants’ body movement as an input signal. The key that the participants pointed to would be selected and displayed on the laptop display. Weaver et al. (2005) designed a wearable eye-tracking-based typing system. It enabled people to select a letter of the alphabet with their gaze. Similarly, Wołk (2019) proposed an AR-based AAC system for medical communications. It aimed to help people with a speech impairment to request medical support or describe their status. There were 50 participants involved in the trials, and it was reported that all of the participants could complete their assigned tasks. For a BCI embedded VR-based AAC, Käthner et al. (2015) compared the BCI performances among different display options, including the conventional desktop display and VR display. It was found that a comparable level of accuracy and Information Transfer Rate (ITR) could be achieved by both types of displays. They also pointed out that this system was feasible for a person with Locked-in-State (LIS). There were also a few studies that investigated BCI-based VR/AR for other purposes rather than communication. Ortner et al. (2012) detected the participants’ Motor Imagery (MI) and reflected them in VR glasses. This device was found to be useful for movement rehabilitation. Besides movement rehabilitation, the BCI-based VR/AR could also put the participants in control of virtual objects as well as smart home devices. Another example was from Amaral et al. (2017) and Amaral et al. (2018). They explored the social attention effects of BCI-based VR intervention among people with ASD. The mechanism of this device was very similar to an AAC device. The participants were asked to select one target virtual object out of eight candidates with their P300 BCI signals. If the virtual objects were replaced by communication board pictures/words, this device could have been used for AAC purposes. In addition, the performance increase was discovered in long-term use of BCI-based VR system by Hashimoto et al. (2010). It suggested that practices could facilitate people to better adapt the BCI controls in the VR systems.

The above-mentioned studies did not consider using MR-based AAC system. In this paper, we hypothesise that using MR-based AAC holds the promise to break down the boundary between real life and the virtual world. Therefore, we proposed a novel design by combining MR, EGT and BCI as a wearable AAC (wAAC) system. Moreover, as advocated by Boster et al. (2017), AAC designs should be driven by the language development theory. It was also implicated by Holyfield et al. (2017) that AAC devices for adolescents and adults should have capacity beyond basic requests. Hence, rather than a static communication board, our design also integrated a real-time object detection algorithm which could automatically recognise daily life objects and map them as words in our communication board. Although the proposed device was a universal solution for general people with CCN, people with CP were selected as the target group in this study.

2 Methods

This study used a multiple case study design. Several trials were performed over two groups of participants: neuro-typical participants and people with CP. This multiple case study is to provide insights into the usability and acceptability of using EG and BCI technology in a wearable MR environment.

2.1 Device setup

2.1.1 Overview

Current AAC devices are often based on a desktop environment and mounted on a wheelchair (Karlsson et al. 2018), which lacks privacy and portability. This study proposes a new wAAC interface that utilises Eye-Gaze (EG) and BCI technology in a wearable MR environment. The users can wear the devices on their head. This study uses pupil movement as the EG input and access method. Rather than EG, BCI utilises electrical signals on the surface head, and the scalp. BCI can predict the user’s attention by detecting EEG signals. The EEG feature used in this study is called P300, which is the EEG shift at around 300ms after the visual stimulus, a gentle flash from the MR display. The proposed device provides these two methods as alternative access methods for the user to communicate.

Before the user was introduced to the tasks, a setup task to either configure the network connection or select a session was performed (Fig. 1). There was a seamless transition between the EG and BCI sessions. For example, if the user entered the EG session, he/she could not transfer to the BCI session unless exited the app and restarted again, and vice versa.

Fig. 1
figure 1

The setup page is the first page when the app started. This picture is captured from the Unity development environment so there is not a real background scene

2.1.2 Communication board

The HoloLens 2 from Microsoft was adopted as the MR environment provider. It is the second generation of Microsoft’s mixed reality products. The display has a 52\(\circ\) Field of View (FOV) (Xiong et al. 2020) and refreshes at 60 Frames Per Second (FPS). It provided an MR environment (Fig. 2) where a communication board could always be displayed in front of a user’s eyes as they look around the real environment. The basic communication board had essential words for daily life purposes. The word list was organised as a circle. An additional advantage of the HoloLens 2 was that it had several cameras which could detect its surroundings. When daily-life objects were detected inside the word list circle, a complementary vocabulary of the objects’ corresponding words, such as the word apple and word cup in the scene below would be presented. The interface was designed by the Unity engine. The main camera mounted in the central front of HoloLens 2 was utilised to capture real-time images. In the EG system, the real-time capture was one frame every 10 s, while in the BCI system, the capture happens in between word selections.

Fig. 2
figure 2

A real scene capture from the trials showing the mixed reality communication board hovering over a background of an office with an apple and cup in the scene that was detected by the machine learning/computer vision server and automatically labelled with a text description for context-based communication

Once the environment was captured, the image would be sent to the machine learning/computer vision server, running an object detection system called You Only Look Once Version 5 (YOLOv5). For quicker response, the small-size model (YOLOv5s) was adopted. The maximum frame per second (FPS) for the YOLOv5s was 455 (Jocher et al. 2020), which indicated that it had sufficient capability to respond quickly. Once the object was detected, the word describing it would be presented on the eye-ray direction to the object. Although the HoloLens 2 had a 3D gazing prediction, this feature was not used as practically it lacked robustness in precision. Alternatively, all words would be listed in a constant distance, one metre in front of the participant.

Each word was contained in a cube (see Fig. 2). Apart from word cubes, there were two types of selections displayed in longer cubes. The long grey cube would display the selected words. The green and red cubes were function cubes. If the participant selected anything wrong, the red Delete button could be used to delete one word from the sentence. If the participant finished the sentence, the green Speak button could be used. Once the Speak button was selected, the whole sentence would be sent to the text-to-speech server and the speech would be generated.

2.1.3 Eye-Gaze system

The HoloLens 2 embedded eye-tracking and EG sensors were used. The original EG data were in a spherical coordinate system. To map the gaze point onto the different depth of the communication board, transfer mathematics were employed (Fig. 3):

$$\begin{aligned} K=\frac{\vec {OI}^2}{\vec {OI}\cdot \vec {OA}} \end{aligned}$$
(1)
$$\begin{aligned} \vec {OB}=K\cdot \vec {OA} \end{aligned}$$
(2)
$$\begin{aligned} B=\vec {OB}+O \end{aligned}$$
(3)

where K is the scaling factor of OA to OB.

Fig. 3
figure 3

Eye-gaze position transformation process. O is the user’s eye position. I is the centre of the communication board. A is the original gaze position from HoloLens 2 library. B is the user’s real gaze position on the communication board

After this transformation, the gaze was then presented as a red dot on the communication board. If the gaze was held over a cube for more than three seconds, the cube would be selected, and either the word of the cube would be selected or the function of the cube would be triggered.

A simple grammar correction was applied to compensate for the insufficiency of the communication board. The GingerIt software was utilised for grammar correction in this project. In this specific situation, there were two major advantages. The first one was that the grammar correction could help to add missing propositions. For example, “I want eat” could be corrected as “I want to eat”. The second one was that the limited word list could be reused. For example, “She do not like TV” could be corrected as “She does not like TV”. The simplest grammar correction was adopted due to concerns that complicated grammar corrections may mis-interpret the users’ intention. The pyttsx3 software was used as the speech generator machine in the system. It had several accent options. According to our circumstances, the Australian girl accent was selected as the speech generation preference.

2.1.4 Brain–computer interfaces system

G.Tec’s g.Nautilus SAHARA was used as the EEG capturing system. In terms of different performances of BCI systems in VR environment, Amaral et al. (2017) revealed that the g.Nautilus (gTec, Austria) with active electrodes outperformed the other two EEG caps, g.Mobilab+ (gTEC, Austria) with g.SAHARA active dry-electrodes and V-Amp with wired actiCAP Xpress dry-electrodes (BrainProducts, Germany). The electrodes along the central line (top of cap) were employed (FP1, FP2, AF3, AF4, Fz, F3, F4, FC1, FC2, CZ, C3, C4, CP1, CP2, PZ, P3, P4, P7, P8, OZ, PO3, PO4). The channel selections were based on the considerations of comfort while also wearing the Hololens. The original EEG cap had 32 removable dry electrodes. However, according to the experience from the developer, 22 selected channels would be the maximum channel number without discomfort.

The sampling rate was set to 250 Hz. A notch filter of 48–52 Hz and a band-pass filter of 0.2–60 Hz were applied. The server of the EEG capturing system was coded in Python 3.6.5. The scheme of the BCI interface was referenced from Amaral et al. (2017, 2018). The BCI session was separated into two sections: training and classification. In the training section, the participant would be asked to focus on 12 circled cubes, one at a time. From the ‘back-end’ perspective, one flash was called an event. There were two types of events, the target event and the non-target event. The target event stood for when the circled cube was flashing, and non-target event was vice versa. In the training section, ten cubes would be randomly selected, where nine of them would be non-target events, and one of them would be target event. These ten events would flash in a random order, which was called an epoch. Eight epochs were included in a block. Since there were 12 target cubes, there were 12 blocks in total. Once the training section was finished, the data would be transferred to the server running an EEG classification which used Linear Discriminant Analysis (LDA) as the classifier. This was found to have a quick response and acceptable accuracy level in our previous study (Zhao et al. 2019). The eye-tracking data was also recorded for later offline analysis.

The server computer had the EEG capture system. As soon as the server received the command from HoloLens 2, it would retrieve the EEG data according to the recorded flashing timestamps. As timestamps were gained from both the HoloLens 2 and the server computer, there was a time difference between two systems. In many situations, this time difference could be ignored as it was less than one second which is the limit for the user’s flow of thought to stay uninterrupted (Nielsen 1994). However, in such a time-sensitive task, this would invalidate the whole system. Hence, a time synchronisation mechanism was developed. The HoloLens 2 sent the current time to the server, and the server would immediately record its current time. Meantime, the server would send an acknowledgement signal to the HoloLens 2. As soon as the HoloLens 2 received this acknowledgement signal, it would send another current time to the server. This was for network delay estimation. The time differences between two timestamps from HoloLens 2 was a round trip of network connection. The server would then calculate the time difference with:

$$\begin{aligned} \Delta T = T_\textrm{s} - \frac{T_{\mathrm{{h}}1} + T_{\mathrm{{h}}2}}{2} \end{aligned}$$
(4)

where \(\Delta T\) meant time difference, \(T_\textrm{s}\) meant server’s local time, \(T_{\mathrm{{h}}1}\) meant the first timestamp received from the HoloLens 2, \(T_{\mathrm{{h}}2}\) meant the second timestamp received from the HoloLens 2.

After offsetting the time difference, the EEG data would be split into 1.4 s’ fragments. The EEG data would be combined with the P300 events information as a BCI data set. This data set would then be fitted into the LDA model as training data. The server would send an acknowledgement message to the HoloLens 2 once training was finished. The HoloLens 2 would then start BCI sessions which flashed all words in a random order eight times. At the same time, synchronisation would be employed before each session. Whilst the word/function cubes were flashing, the server was running an LDA classifier in the backend to detect whether P300 was presented. The classifier would analyse one 1.4 s’ fragment and calculate its probability of being a P300 signal. All probabilities from eight flashes would be averaged. The cube with the highest probability would be selected, and the server would respond with this prediction result.

2.1.5 System integration

The whole system contained three devices: the HoloLens 2, a computing server, and an EEG cap (see Fig. 4). The EEG cap was only for EEG data capture. The cap would collect EEG data and send it to the computing server.

Fig. 4
figure 4

The primary component communication board on the client side contains four elements: basic words, real-time object detection generated words, function buttons and an eye-gaze indicator. The user can use their eye movement (reflected by eye indicator) to select words on the communication board. Alternatively, the communication board will flash over the target range. While the communication board is flashing, the server will receive EEG signals simultaneously. Once the attention is detected on the target word, the word will be selected. The selected word will be displayed in the sentence composing area in MR glasses. If the “Speak” function button is selected. The whole composed sentence will be forwarded to the central server through WiFi. The server will then generate the soundtrack by using its processor

The computing server had five functions. Firstly, it could receive data from the EEG cap and interpret them into BCI commands with the LDA algorithm. Secondly, the server was responsible to receive camera captures from the HoloLens 2 and label the detected objects in the picture. Once it finished labelling, it would send the label information back to the HoloLens 2. Thirdly, the server would receive the composed sentence from the HoloLens 2 and generate corresponding audios to play. Additionally, the server was also responsible for simple grammar correction and time synchronisation (see Fig. 5 for details.)

Fig. 5
figure 5

The centralised computing system integrates several functions to couple all components together. It provides a series of eye-movement processors and EEG processors to analyse the users’ attention and intention. The captured image is also processed on this server. Additionally, a time-sync algorithm is also implemented in the server to smooth the communications between each subsystem

The HoloLens 2 was the primary MR environment which the user would directly operate from. It had a communication board and provided different interaction options to the user. From a backend perspective, the HoloLens 2 was also the eyes of the system. Its camera was in charge of capturing the environment, and its eye-movement sensor was the data source of the EG session.

2.1.6 Key specifications of core components

The weight of HoloLens 2 was about 566 g, equivalent to about an iPad Pro, 2019. It also had an extendable strap which allowed it to fit most head sizes. In the EG system, only HoloLens 2 were needed. In the BCI system, the EEG cap was put on first, and the HoloLens 2 was placed outside of the EEG cap (Fig. 6). All servers were run on a Surface Book 2 with Intel Core i7-8650U CPU and NVIDIA GeForce GTX 1060 discrete GPU w/6GB GDDR5 graphics memory. Windows 10 Pro was the operating system with some essential Mixed Reality development services installed. To facilitate the performance of YOLOv5 and EEG classification, CUDA v11.2 for Windows was deployed on the laptop.

Fig. 6
figure 6

A photograph of the developer wearing both EEG cap with 22 electrodes and HoloLens 2 placed over the top. Note both systems are wireless and together create the combined AR and BCI, wAAC system

2.2 Experiments

2.2.1 Recruitment

The recruitment strategy was conducted in two steps. First, neuro-typical participants were recruited through student portals and social networks at The University of Sydney using flyers. Secondly, participants with CP were recruited through the Australian CP Register and through the Cerebral Palsy Alliance.

Interested participants were screened over phone/e-mail. Upon making contact with the participants, researchers ensured that the participants met the requirements and meeting details were then arranged. Then, the participant information sheet and consent form were sent out. A quick check pre-screen survey was also prepared. The participants’ information sheets would be linked to the survey so that the participants could learn more about the study. If the participants chose to use this to check the fitness of the study, the screen questions would not be asked during the email/phone communication. Instead, the meeting time and meeting place would be discussed. Additionally, the researchers also helped to answer their questions and provided any details the participants wanted to know.

There were two types of inclusion criteria, one for neuro-typical people and one for people with cerebral palsy. For neuro-typical people, the inclusion criteria were: 1. Aged 18+ years; 2. The literacy level of understanding words in the communication board. For people with cerebral palsy, the inclusion criteria were: 1. Aged 18+ years; 2. Gross Motor Function Classification System Expanded and Revised (GMFCS E &R) I-V; 3. Manual Ability Classification System (MACS) I-V; 4. Communication Function Classification System (CFCS) I-III; 5. Eye-Pointing Classification Scale I-III; 6. Being able to indicate yes or no on the dichotomous choice screen; 7. The literacy level of understanding words in the communication board. The exclusion criteria were: (1) History of psychosis; (2) History of photosensitive epilepsy.

2.2.2 Experiment preparation

Prior to the participant arrival, the investigators had prepared and set up the experiment environment and device. The server was started on a laptop, and a wireless hotspot was established to receive signals from the HoloLens 2. The HoloLens 2 was checked for a full charge and connected to the laptop’s WiFi. The g.Nautilus was wirelessly connected to the laptop, and a demo app named g.NEEDAccess was run prior to the session to ensure the headset is functioning. Although the EEG caps’ instructions declared that it could be used dry, a conductive solution (saline) was found to be helpful in achieving a signal from the EEG cap. In the first 15 sessions saline solutions were used sparingly, and over time was increased to 10 ml saline to ensure that the EEG signals were optimised. As a result, the 10 ml saline was compulsory in the last six sessions. The saline was a contact solution; no brand was specified.

A few daily objects were placed on or beside a plain table as a simulation of the real-life scenario. These items were used in some or all of the sessions: a water cup, an apple, a television and a chair. These items would be mapped onto the communication board as vocabulary candidates.

When the participants arrived, an introduction about the project and the experiment procedure would be given. Then the participants’ information sheets were provided to the participants and/or their guardians, and consent was sought. Although the guardians’ version of the consent form was prepared, it was never used as all participants were able to give consent. They were informed that the session could be terminated at any time. For hygiene purposes, all equipment was cleaned before and after the experiments.

2.2.3 Tasks

The goal of the tasks was for the participant to produce three sentences from these templates: (a) I want [object] (b) Do you like [object] (c) I do not like [object]. Nonetheless, these sentences were just examples and not restrictive. If the participants proposed any different sentences allowed by the capability of the communication board, they would be used as target sentences. Once the sentences were agreed upon by both participants and investigators, the tasks would start. The same sentences would be repeated twice (three tries per sentence) to assess the impacts of the practice. The repeating times would be decided by the participants. The participants would indicate whether they would like to be invited again for more repeating times if the time were not enough on that day.

The same procedures were carried out in both the EG session and the BCI session. Before the EG session, the instructions were demonstrated to the participants. In EG session, the participants were presented with the communication board. They were asked to initiate by negotiating a sentence with the investigators before they started. If they forgot the word they selected, they could ask for assistance from the investigators. If they preferred to change the word instead, they were asked to specify the words they intentionally changed after the session. As the real-time object detection would take up to one second to be refreshed every ten seconds, the participant was directed to wait if the communication board disappeared and refreshed itself in one second. If the object was not detected in that refresh, the participant was asked to wait for the next refresh. When the device selected the wrong word, the participant was told that they could change it by selecting the “Delete” button. Once they finished the sentence, they could select the “Speak” button for the device to audibly read out the sentence.

In the EG session, the eye orientation and gaze targeting position were recorded and analysed. The pointer was to indicate the user’s intention. If the user felt that the pointer did not reflect their focus accurately enough, they could re-calibrate the EG with assistance from the investigator. The HoloLens 2 had a calibration integrated with the system. The investigators would help to navigate to the calibration page and return the HoloLens 2 to the user to finish the calibration. Before the BCI session, the instructions were illustrated to the participants. In the BCI session, the participants were told that the communication board would look similar with one difference. The only difference was that instead of a pointer, the words would flash in random order. The participants were told that there would be a training session before the selection. The purpose of the training session was explained as the machine needed to learn the participants’ brain wave patterns. In the training session, there were instruction boxes it would direct the participants to look at the specific words. The participants were suggested to count the flashing times of the target word in silence. It was explained to the participants that this would help them to concentrate. When the training session finished, the instruction box would guide the participants to wait for a couple of seconds. Meanwhile, the server was performing a training algorithm. After the algorithm stopped running, the instruction box would present again and notify the participants to start word selection. If the communication board selected a wrong word, the participants were asked to skip it instead of correcting it. Similar to the EG session, the participants were told that they could change the word if they preferred to do so. If they changed a word on purpose, they were asked to specify it after the session.

Participants NT01, NT02 and NT03 reported they misunderstood the initial instructions given in their first BCI sessions. Therefore, a demo video was made by recording the NT02’s successful session. This video was recorded from first-person perspective. The participant’s profile was concealed in the video. This video was displayed to the later participants for demonstrations (Fig. 7).

Fig. 7
figure 7

The first-person perspective demo video screenshot is captured from participant NT02. This is a BCI training session screenshot where the blue box in the centre is the instruction box

In the EG sessions, if the participants selected the wrong word, they could use the “Delete” function in the communication board to correct their selections. However, in the BCI sessions, this was not required. This was attributed to the low accuracy of the online classification result. In the BCI session, even if the participants realised the system response with a wrong selection, their deleting intention may be misinterpreted. Therefore, the participants were asked to ignore the wrong selection and skip to the next selection instead.

2.2.4 Survey

Once all sessions were completed, two short surveys were provided to the participants to capture their subjective experience of the session. The first survey was designed based on QUEST 2.0 (Demers et al. 2000). The marking scale was from 1–5, where one meant not satisfied and five meant very satisfied. This survey was used to understand whether the device was an acceptable wAAC device to the user. Questions 1–8 and multiple selection questions were included in this study’s survey. Questions 1–8 evaluated the device regarding its dimensions, weight, ease in adjusting, safety and security, durability, ease in using, comfortableness and effectiveness. The multiple selection questions was asking the most important three factors for a wAAC device. Questions 9–12 were excluded as those were questions for service experience which were not relevant to this study.

The second survey was NASA-TLX (Hart 1986), which evaluated the task load of using the device. It asked three questions about the tasks regarding their mental demand, physical demand and temporal demand. Another three questions were from the participants’ perspective about their self-experienced performances, efforts and frustrations to complete the tasks. The survey was in a ruler format with 21 gradations on the scale. In our analysis, every scale was counted as 0.5 point. The minimum score was zero, which indicated the least demanding, least efforts and best performance. The maximum score was ten, which indicated most demanding, most efforts and worst performance. In summary, the lower score meant less workload.

2.3 Data analysis

Descriptive statistical analysis, such as sum and mean were applied to the data. The information transfer rate was calculated based on the selections amount against selection time, regardless if the selections were correct or wrong. The accuracy was defined as the correct selections divided by the target selections, including function button selections, such as “Speak” and “Delete”. Specifically, for the EG data, specificity and sensitivity were also calculated. Since this trial was not a clinical trial, the counting definitions were amended to fit our specific situation. The True-Positive referred to correct selections of the words and the “Speak” button. The False-Positive referred to wrong selections of the words and the “Speak” button. The True-Negative referred to correct selections of the “Delete” button. Finally, the False-Negative referred to wrong selections of the “Delete” button. The specificity was calculated as true negatives over the sum of true negatives and false positives. The sensitivity was calculated as true positives over the sum of true positives and false negatives.

However, as there was no delete and correction step in the BCI session, the specificity and sensitivity were not analysed there. Alternatively, an offline analysis was performed to explore whether the online classification method was one of the factors that decreased the accuracy level. The EEGNet (Lawhern et al. 2018) was found to be the state-of-the-art offline BCI analysis tool in our previous study (Zhao et al. 2020). During the development phase, the usage of all 22 channels was experienced to have the highest accuracy. In the offline analysis, only central channels Fz, C3, Cz, C4 and back channels P3, Pz, P4, PO3, PO4, and OZ were used, and a 30Hz low-pass filter was applied. This referenced the setup protocol from Amaral et al. (2017). A pilot comparison between all-channels classification and central-channels classification was conducted, and the result showed that central channels had a higher accuracy. The selection procedure and comparison of different offline data processing methods are out of this paper’s scope. Thus, the details about these will not be included in this paper’s results. The Wilcoxon rank-sum test was used to analyse the usability and acceptability difference between EG and BCI. The Wilcoxon signed-rank test was used to analyse the effects of the practice.

3 Results

3.1 Participants and sessions

3.1.1 Demographics

There were eight neurotypical people and two people with cerebral palsy in the study. The age range for the neurotypical participants were 24–44 years (mean = 29.13, SD = 6.31) and the age range for participants with cerebral palsy was 23–32 years (mean = 27.5, SD = 4.5). Six of the neurotypical participants were male and two of them were female. Meanwhile, of the participants with CP, one of them was male and the other one was female. All six of the neurotypical participants stated that they never used any type of AAC before. One of the neurotypical participants had tried both EG and BCI-based AAC. Another person in the neurotypical group only tried BCI-based AAC. Of the participants with CP, one of them had never used any AAC device, while the other participant with CP reported that they had tried BCI-based AAC before as they were involved in another study about BCI. Both participants with CP were Level I for GMFCS and CFCS, which indicated that they were independent of guardians/parents/support workers regarding mobility and speech. For the MACS, both of them were Level II. This suggested that they could handle most objects but with reduced quality and/or speed. They were both able to verbally communicate and they all volunteered to trial the system and give feedback from the general perspective of people with CP.

3.1.2 Session information

All participants finished at least two repeats of both the EG and BCI sessions. All participants with CP had three repeats whilst five neurotypical people repeated their sessions three times (Table 1). The repeats of the EG sessions were all finished on the same day with a short break in between. The average time for a single EG session was 40.11 s.

Table 1 EG session information

Twenty-two sessions were excluded from the result. For the first four sessions, participants NT01, NT02 and NT03 reported they misunderstood instructions in their first BCI sessions. Therefore, these sessions were excluded from the result. NT01 also reported that they were distracted in the training part of their second session. This session was also excluded from the result. Additionally, as there were bugs existed in some sessions, these were also excluded from analysis. The system bugs were: (A) The real-time recognition-generated cubes were not always flashing during BCI selections; (B) When the real-time recognition-generated cubes were selected, the app might crash and exit. Bug A existed in two sessions. When Bug A presented, the participants selected other flashing cubes. Bug B existed in 18 sessions, including two sessions with Bug A. When Bug B was presented, the participants thought it was the end of the session and stopped operating. These sessions were excluded from result analysis. Nevertheless, for the best reporting, these conditions were denoted for future references. Repeated BCI sessions were performed on different days, since the BCI session was much more time-consuming than EG session. Most of the participants were asked to pause and continued the repeats on another day.

Although the manuals of the EEG cap stated that no conductive mediator was needed, the investigators found it differently during trials. Despite the high accuracy investigators experienced in the testing phase, significant accuracy drops were detected in real trials. The usage and amount of saline spray were initially determined by the participants in the earlier BCI sessions. In the first five sessions, the participants chose not to use spray saline at all. In the later 10 sessions, the spray saline was used but the amount was determined by the participants. In the last seven sessions, the amount of saline spray was set to be at least 10 ml for better conductivity. Please see Table 2 for the details. These conditions were grouped into six subsets: Condition O-I, O-II, O-III, A, B and C (Table 2). Condition O–I to O–III would not be included in the later results as the system bugs were detected.

Table 2 BCI session information

3.2 Accuracy and information transfer rate

3.2.1 Overview

The average accuracy for the EG system was 93.30% (Fig. 8). The corresponding sensitivity was 98.78% whilst the specificity was 50%. The average accuracy for the BCI system was 5.91%. This could be attributed to the rough time difference estimation and the nature of the linear classification algorithm. This will be further discussed in the next section. The offline analysis result for BCI was 9.28% (Fig. 8). The chance level was variant depending on the number of real-time recognition-generated cubes. The minimum chance level was 3.85% whilst the maximum was 4.35%. The Wilcoxon rank-sum test result of accuracy between EG and BCI (online) was U = 0, p-value = 1.621e−4 (when confidential level = 95% and sample size = 10, U critical value = 23). The test result showed a significant difference between EG and BCI sessions.

Regarding the BCI analysis methods, the Wilcoxon signed rank test of accuracy between online BCI and offline BCI was Z = 1, p-value = 7.96e−2 (when confidential level = 95% and sample size = 10, Z critical value = 8). Hence, the different algorithms between online and offline classification tasks were considered as contributions to the accuracy differences.

Fig. 8
figure 8

Both accuracy and Information Transfer Rate (ITR) increased after repeating more times in Eye-Gaze (EG) session

The average ITR for the EG system was 8.55 sels/min, where the average ITR for CP was 6.52 sels/min and the average ITR for NT was 9.42 sels/min. Meanwhile, the BCI system only had 1.13 sels/min, where the average ITR for CP was 1.14 sels/min and the average ITR for NT was 1.13 sels/min. The Wilcoxon rank-sum test of ITR between EG and BCI (online) was U = 0, p-value = 1.827e\(-\)4 (when confidential level = 95% and sample size = 10, U critical value = 23). The test result showed a significant difference between EG and BCI sessions.

During the trials, all participants described that they were impressed by the accuracy and efficiency of the EG system. In contrast, they also reported their frustrations with the low accuracy they experienced in the BCI session.

3.2.2 Practice effects

Ten participants repeated the EG session more than twice with seven of these repeating their sessions three times. The average accuracy and ITR increased after repeats (Fig. 9).

Fig. 9
figure 9

Both accuracy and Information Transfer Rate (ITR) increased after repeating more times in Eye-Gaze (EG) session

When the sample size is 10 and the confidential level is 95%, the critical value of the Wilcoxon signed rank test is 8. The result showed that the performance improvement between the Repeat 1 and Repeat 2 was not significant. However, the p-value was decreasing with more repeats. This indicated that the differences became significant after further practice. When the sample size is 7 and the confidential level is 95%, the critical value is 2. Therefore, the third repeat had a significant improvement compared to the second. The comparison between Repeat 1 and Repeat 3 also confirmed that repeated practice would enhance the user’s performance with the EG system (Table 3).

Table 3 Wilcoxon signed rank test (EG Session)

Only four participants repeated the BCI session three times due to the long time consumption of BCI tasks. The selection time of BCI session was highly dependent on the number of cubes and the classification algorithm. Therefore, the practice should not affect the ITR. Although the accuracy might be different, due to the time synchronisation issue and algorithm issues, the average accuracy was sometimes less than the chance level (Fig. 10). As a result, the statistical analysis was not applied to repeated BCI sessions.

Fig. 10
figure 10

The accuracies of Brain-Computer Interfaces (BCI) session were all slightly higher than chance-level. Additionally, only four participants finished three repeats. Therefore, the statistical analysis was not performed here. Repeat 1 had 10 participants, Repeat 2 had 7 participants, and Repeat 3 had 4 participants

3.2.3 Bugs and protocol changes effects

Similar to the performance comparison among different repeats in the BCI session, the comparison of BCI accuracy under different conditions was also moot due to the low overall accuracy level (Fig 11). The accuracy under each condition was just slightly higher than the chance level (3.85%\(-\)4.35%). Further discussion about the reasons behind the low accuracy of the BCI system will be illuminated in the next section.

Fig. 11
figure 11

Both online and offline Brain-Computer Interfaces had different accuracies under different conditions. However, due to the overall low accuracy level, the correlation between accuracy and conditions became moot

3.3 QUEST 2.0 and NASA-TLX

The average QUEST 2.0 result was 3.69, which was higher than the medium satisfied score, 3.0. The average QUEST 2.0 score from the neurotypical group was 1.25 higher than the group with CP. The neurotypical group marked a higher score than the group with CP for both EG and BCI systems. (Table 4). The Wilcoxon rank-sum test for the average QUEST 2.0 score between EG and BCI was U = 16, p-value = 1.121e\(-\)2 (when confidential level = 95% and sample size = 10, critical value = 23).

The average NASA-TLX was 3.85, which was lower than the medium NASA-TLX score, 5.0. Similarly to QUEST 2.0, the neurotypical group had a better experience than the group with CP. The average NASA-TLX score for the EG system was 1.79 for the neurotypical group and 3.32 for the group with CP. Both of the groups’ NASA-TLX scores for the EG system were lower than 5.0. In contrast, both groups marked NASA-TLX scores for the BCI system higher than 5.0. The neurotypical group marked 5.20 whilst the group with CP marked 7.00 (Table 5). The Wilcoxon rank-sum test for the average NASA-TLX score between EG and BCI was U = 3, p-value = 4.396e\(-\)4 (when confidential level = 95% and sample size = 10, critical value = 23).

Table 4 Average QUEST
Table 5 Average NASA-TLX

The three highest-rated factors of positive importance for AAC devices reported by the participants were: effectiveness, comfort and easy to use (scores of eight, six and six, respectively). Besides these three, weight was also considered as an important factor (three scores). The rest rating scores were: dimensions (2), adjustments (2) and durability (2).

In addition to the survey data, verbal feedback was given by participants. Three participants (NT04, CP01, CP02) reported that the dry EEG cap was acceptable when worn without HoloLens 2. However, the weight of HoloLens 2 increased the pressure of the EEG cap, and the spikes in the EEG cap electrodes made them feel uncomfortable and even painful after long-time wear. These three participants also negatively rated weight in the QUEST 2.0 survey.

4 Discussion

4.1 Usability and acceptability

The results indicated that an MR-based AAC device is feasible to be used for communication purposes. The EG system showed high accuracy level and acceptable ITR. In addition, the EG system demonstrated high sensitivity. It meant that although the system may make errors, it can be amended with the use of the delete function button. All participants were able to finish composing three sentences with the EG interaction at least two times. For the EG system, the results also demonstrated that both the accuracy and ITR were increased after practice. The statistical analysis revealed that the practice had a significant impact on the accuracy and ITR improvement. The QUEST 2.0 and NASA-TLX results implied a high satisfaction rate and low physical efforts to complete the tasks. The important factors revealed by QUEST 2.0 reflected that the device still has room to be improved in regard to its safety, durability, size and adjustments. Notwithstanding, the result still suggested the EG-based AR AAC design was acceptable to most of the participants. This wAAC design provided privacy, portability and acceptable accuracy for people with CCN, such as people with cerebral palsy in this study. The result showed that the EG-based wAAC design was novel and useful for people with cerebral palsy. It could potentially improve their communication experience. Although more people with cerebral palsy and more severe speech impairment would be needed to further validate the system’s robustness, the current EG-wAAC design exhibited adequate usability and acceptability for people with cerebral palsy and CCN.

On the other hand, the results from the BCI sessions were debatable. The QUEST 2.0 and NASA-TLX results show that the participants experienced disappointment and frustration. The statistical analysis showed that the EG system had overwhelmingly better acceptability than the BCI system. Although both online and traditional offline analysis had strong evidence of the low usability of the BCI system, it was still worthwhile to be investigated. One of the reasons was that it may be the only option for people with severe movement disabilities who are not able to control their eye movement. Secondly, the BCI has the potential to be further developed into a more user-friendly system. In addition to the traditional offline analysis, we also performed a special time synchronisation investigation of some cases.

4.2 Time synchronisation: a case study

One of the sessions, the last session of participant NT01, was selected for a case study. This session was under Condition C, where the data was included for analysis, and the spray saline was more than 10 ml. The online BCI classification accuracy for this session was 9.10% and the offline result was 36.36%. As mentioned before, there was a time synchronisation between the server laptop and the HoloLens 2. The server laptop was synchronised to Windows’ official time server at the beginning of the session. The HoloLens 2 had its own synchronisation mechanism. It was synchronised to the same server as the server laptop but the sync time was unknown. At the start of the session, the time synchronisation information showed that the HoloLens 2 sent the time-sync request at 2022-02-09 14:33:23.792, HoloLens 2’ Sydney Time. The recorded receiving time from the server laptop was 2022-02-09 14:32:46.584, the server laptop’s Sydney Time. The time difference was \(-\)37.208 s. The second response time recorded from the HoloLens 2 was 2022-02-09 14:33:23.894. Therefore, the estimated ping was 0.051 s.

Nonetheless, this ping did not perfectly match the expected accuracy. The performance was just 36% accurate. This was far away from the results during the development phase. Hence, a further time difference exploration was performed. The algorithm used a brute-force searching method. The network ping was tested from 0.4 to 0 s, where the decreasing step was set to 0.004 s, the minimum for a sampling rate of 250Hz. The data processing method and classification remained the same as the conventional offline analysis introduced in the previous sections: a 30 Hz low-pass filter, central channels selection, and the EEGNet. The highest accuracy appeared at ping = 0.376 s, which is 64%. The lowest accuracy appeared at ping = 0.264 s, which is 0%. The estimated ping = 0.51 s was rounded to ping = 0.052 s and the result remained the same, 36%. Please see Fig. 12 for details. This analysis result implicated that the current network delay estimation method could be improved, and it might be the major attribute of the low accuracy in the classification results.

Fig. 12
figure 12

The highest BCI accuracy of the NT01 Session 08 appeared at 0.376 s with 64% whilst the lowest appeared at 0.264 s with 0%

Although the time synchronisation issue was detected, it could not be resolved at the time of the experiments. Additionally, the further offline analysis could only help us to understand this issue but the current design requires a very lengthy offline analysis. This will not be addressed in full in this paper as it is out of its scope but will be described in a separate publication.

4.3 Review of the BCI system used in this study

There were a few issues preventing BCI-based wAAC from its best performance. Firstly, there were a few bugs existing in the system. Although these bugs did not interrupt the trials, there was still a probability that they may influence the results. Therefore, these data was excluded from the analysis. The conductive saline is another factor. In spite of overall low accuracy, the large amount of spray saline increased the accuracy level. The data of the case study of participant NT01 were also collected with more than 10 ml saline. For future reference, the conductive saline is recommended despite the electrode type. It could alleviate the high impedance for the dry electrodes.

The major challenge of analysis for this BCI system was because of time synchronisation. As the case study indicated, proper time synchronisation would significantly improve the accuracy performance of the session. Time synchronisation is not only a challenge for this study but also a common problem in distributed computing systems with wireless sensor networks (Römer et al. 2005).

In summary, the issues existing in the system were bugs, lack of conductive saline, and time synchronisation. The bugs could be easily fixed. The conductive saline should have been used even if the electrodes were dry type. And time synchronisation was the major source of errors in the system. This raises a more general topic for future studies.

4.4 Limitations

In respect to the time synchronisation issue, at present there is no perfect online solution. Although an offline analysis can compensate for the accuracy loss, it cannot currently be used in the online system. However, the learnings can be used in future design. The software part has its limitations in performance enhancement. Hardware modification should be a better solution. In another word, the second limitation of this paper is hardware development. The hardware components are mostly from commercial products which lack the flexibility of being modified. Therefore, the BCI system is not robust with the current design. Moreover, there were complaints by the participants that the spikes of the EEG cap were painful, especially with the HoloLens 2 on the top. It was hypothesised that the dry electrodes should have had their benefits regarding their less gel and less calibration waiting time. However, in reality, the pure dry electrodes have a low accuracy level which is just slightly above the chance level. Amaral et al. (2017) also confirmed this. The saline becomes necessary to achieve a higher accuracy level. Nevertheless, even if the saline was sprayed, the classification results were lower than traditional wet electrodes.

According to the conversations between the participants and investigators, some people with CP may have involuntary head movement which may decrease the accuracy of the EG system. This was not taken into consideration during the design phase. One of the purposes of this study is to collect vital feedback like this. Additionally, one of the participants with CP also mentioned that the eye-tracking calibration session integrated in the HoloLens 2 still needs one click, which requires users’ hand movement. This task was straightforward for most neuro-typical people, but for people with CP who have severe movement disabilities, the requirement of hand movement will prevent them from completing this task. If a back-end control system can be implemented, it will be easier for their support person to help them to start the session.

Besides the software and hardware limitations, the study also has its shortcomings. Firstly, the sample size of participants with CP was small in this study. One of the reasons was that the worldwide pandemic made it challenging to recruit. People were reluctant to participate in person for these types of trials, especially when close contact was needed. Extra safety protocols were set and approved by the Human Ethics Committee from the University of Sydney to protect the participants, in particular for COVID-19. Despite the extra protocols, the severity and high infection rate of COVID-19 and the accompanying restrictions both hindered the recruitment of potential participants. The low number of participants meant it was not possible to compare different effects between experimental conditions. Secondly, the group with CP did not include participants who were not able to speak, as all participants have some degree of speech ability. However, future studies can involve more people with a more extensive range of severe speech disabilities. Thirdly, although the systems were extensively tested ahead of the participants enrolled in the study, some bugs were still evident and unfortunately, not fully preventable. Fortunately, most participants volunteered to take part in additional tests. Nevertheless, only a limited size of data collected under the best condition could be used in the analysis.

4.5 Future study suggestions

As a solution to the time synchronisation problem is out of the scope of this study and this paper, a further investigation of this will be carried out. The result will be reported in our future publications. Besides the methods used in the case study, visualisation is also planned to be applied. The deep learning model was found to be a powerful tool for EEG analysis (Roy et al. 2019). The deep learning model based EEG analysis has significantly improved the BCI utilisation. Nevertheless, since the BCI is still an emerging assistive technology, its use for AAC purpose is still under development. The challenges of synchronising time precisely between different computing devices were identified in this study.

It was believed by the authors that the deep learning method could also help us to resolve the time synchronisation problem in EEG classification. Hence, a deep learning model of the automatic detection of time differences is also planned to be explored. If this is found to be possible, the software solution may be the quickest way to resolve the BCI system’s usability problem. Another possible software solution is to develop a g.Nautilus driver for HoloLens. The applications in HoloLens 2 are called Universal Windows Programming (UWP). Therefore, the probability of migrating Windows desktop software to the HoloLens 2 exists. The additional inspections of HoloLens 2 and g.Nautilus are necessary for the further development of such a driver.

From the hardware’s perspective, it will be easier if the EEG cap can be integrated with the glasses. An existing example of this solution is Cognixion (2022). The Cognixion’s design shares some similar ideas to ours. However, it is an AR-based AAC, whilst ours is MR based. The Cognixion’s glasses also lack essential environment cameras which can detect objects in the real life. The third difference is that Cognixion’s EEG does not have full coverage of all channels. Notwithstanding, the merit of Cognixion’s design is its integration of BCI and AR glasses. In such a hybrid device, time synchronisation will not be an issue. Moreover, if the whole system can be developed from the beginning, the simpler procedure of eye-tracking calibration can be realised. This will completely eliminate requirements of body movement from the users and make the device more usable and inclusive.

Once all problems are resolved, it is necessary to conduct further trials with a modified system as well as increase the number of participants with CP. Moreover, the application of this technology in other types of disabilities that may lead to complex communication needs, such as the rare SATB2-associated syndrome, needs to be investigated. This study is the first step to pave the way towards an MR-based wAAC. In future works, a more comprehensive solution needs to be developed, validated and verified with a higher number of participants. Working on an improved system and further trials with a modified system and an increased number of participants with CP is the focus of our ongoing research.

5 Conclusion

This study investigates the usability and acceptability of an MR-based AAC design. The participants are most happy with the effectiveness, comfort and ease of use of the designed device. Meanwhile, the main aspects to improve are durability, dimensions, ease of adjustment and safety. The EG interaction system is found to be feasible in such a design. The BCI interaction is not usable due to the limitations of time synchronisation’s precision between two systems. It is challenging without further algorithm enhancement or hardware modification. On the one hand, the system needs a more dedicated design. However, the core components which are from a commercial product also restrict the adjustment to the hardware part. On the other hand, the EEG analysis and BCI interaction need more exploration especially when combined with MR presence. Nevertheless, the acceptability of an EG system is evident. Despite the instability of embedding BCI into the system in this early stage, the EG system presented its strength and advantages in the trials. It could potentially be a better AAC method for people with CCN in the real life. Future research should investigate more about time synchronisation and any issues. To address the problem, we are currently developing an ambitious solution, based on major modifications planned for the technology that includes MR glasses and EEG capturing system.