Do skills developed through musical expertise transfer to affect speech processing? The existing literature suggests that musical expertise is linked to better speech perception in noise (Du & Zatorre, 2017), slower age-related declines in speech perception (Bidelman & Alain, 2015), and enhanced preattentive processing of speech sounds in children (Chobert et al., 2012). Such positive associations have also been reported for different levels of speech processing, ranging from low-level processing such as acoustic perception (Chobert et al., 2012; Schön et al., 2004) to higher ones, such as metric/syntactic analyses (Dittinger et al., 2016; Jentschke & Koelsch, 2009; Marie et al., 2011; also see Magne et al., 2007, for related data from nonmusicians).

The main focus of the existing research on the effect of musical expertise in speech processing has remained at levels of speech (syllables, words, phrases) that are smaller than clause/sentence. In daily life, to comprehend speech, we typically need to process entire clause/sentence rather than just a fragment of it. However, the effects of musical training on speech units of larger scope such as clauses and sentences are yet to be better understood. To shed light on this issue, we investigated the effect of musical expertise on clause segmentation during sentence processing. Clause segmentation is fundamental for speech processing because clauses are basic perceptual units for continuous speech (Hirsh-Pasek et al., 1987) and detecting clause boundaries is the first step to extract meaning from continuous speech (Frazier et al., 2006). In studies on language development, clause segmentation is considered a cornerstone in the development of speech perception (Seidl & Cristià, 2008).

It is well known that segmenting continuous speech into clauses relies on multiple dimensions of information (e.g., acoustic, semantic, and syntactic information). Generally speaking, listeners benefit from a direct syntactic-prosody mapping. A major syntactic boundary such as clause boundary is usually marked by a salient prosodic boundary with longer pauses, final lengthening, and pitch reset (Li & Yang, 2009, for Chinese; Price et al., 1991 for English; Holzgrefe-Lang et al., 2016, for German). Pauses are often defined as intervals of silence (Duez, 1985), while final lengthening as extended duration in the phrase-final syllable (Price et al., 1991). Pitch reset refers to the return to a higher height after pitch declination (Cooper & Sorensen, 1977). The mean magnitudes of these acoustic parameters for boundary marking seem to differ across languages. For instance, in Standard Chinese, the average duration of clause boundary-marking pauses is about 270 ms (X. Yang et al., 2014). In German, the pauses accompanying a major syntactic boundary have a mean duration of 560 msec (Männel & Friederici, 2009), which is much longer than that reported for Chinese (X. Yang et al., 2014). Despite the different magnitudes of acoustic-cue marking for boundaries, brain potential studies show that these cues are immediately taken up during online prosodic parsing (Li & Yang, 2009, for Chinese; Steinhauer et al., 1999, for German).

While pauses, final lengthening, and pitch reset have all been reported to signal clause boundaries, their importance to clause perception (i.e., cue weighting) differs. Pause alone can serve as a sufficient cue to phrase or clause boundary (Scott, 1982, for English; X. Yang et al., 2014 for Standard Chinese). On the other hand, pitch and final lengthening are less strong cues for clause boundary perception. A major syntactic boundary can be perceived when pitch and final lengthening cues occur in combination (even without a pause; Li & Yang, 2009, for Chinese; Holzgrefe-Lang et al., 2016, for German; Steinhauer et al., 1999, for English), but not when only one of the two cues is present (X. Yang et al., 2014, for Standard Chinese; Holzgrefe-Lang et al., 2016, for German).

Acoustic cue weighting for boundary perception varies across languages (Tyler & Cutler, 2009; White et al., 2020). For instance, White et al. (2020) compared the use of word-final vowel lengthening by native English, Hungarian, and Italian speakers, and found that word-final vowel lengthening was only exploited for segmentation by English speakers. Thus far, no cross-linguistic studies have directly compared cue weighting in clause segmentation across languages, which nevertheless may also be language dependent. For instance, the acoustic marking of information status in an utterance is more complicated in tone languages (e.g., Mandarin Chinese), given that pitch also distinguishes lexical meanings (Chen & Gussenhoven, 2008; Ouyang & Kaiser, 2015).

As clause boundaries carry essential information about the structure of a sentence, its segmentation is influenced not only by acoustic factors but also by nonacoustic cues such as syntactic structure and semantic/pragmatic coherence (Himmelmann et al., 2018). Some studies have argued that the difference between musicians and nonmusicians could only be found in low-level psychoacoustic tasks that rely heavily on pitch perception (Boebinger et al., 2015; Fuller et al., 2014). Consequently, music training would not be expected to play an important role in clause segmentation, which mainly relies on temporal (pause) rather than pitch cues. Alternatively, the effect of musical expertise may still be observed as compared to nonmusicians, since musicians also have better high-order cognitive functions such as attention (Besson et al., 2011) and working memory (Clayton et al., 2016), which can transfer to improve speech processing.

To our knowledge, Glushko et al. (2016) is the only study that examined the effect of music training on sentence-level speech segmentation in German (Glushko et al., 2016). Their ERP results showed that clause boundary-related brain responses (also called the language-CPS) in musicians showed a later onset, shorter duration, and smaller amplitude than that by nonmusicians. However, as they did not collect explicit behavioral judgment of clause segmentation, there is no way to verify whether the musicians in this ERP experiment indeed performed better than the nonmusicians in terms of clause segmentation.

In this study, we explicitly test whether musical expertise can affect clause segmentation, and if so, how their enhanced perceptual acuity to acoustic cues of boundary contributes to clause segmentation. We were interested in whether the effect of musical expertise reported for German listeners during clause segmentation by Glushko et al. (2016) can be extended to a typologically very different language–Mandarin Chinese. To this end, we used an explicit boundary detection task and constructed six conditions in our experiment: the all-cue condition in which all three major acoustic cues of a clause boundary (pause, final lengthening, and pitch reset) were present, the pause-only condition, the final-lengthening-only condition, the pitch-reset-only condition, the pause-and-final-lengthening-in-combination condition, and the no-cue condition.Footnote 1 If the effect of musical expertise on speech processing extends to sentence-level clause segmentation, we expected that the musicians would show more acute sensitivity to the acoustic cues at clause boundaries and perform better than nonmusicians at correct boundary detection. Otherwise, there should be no differences between the two types of participants.

Method

Participants

Two groups of college students were recruited. One consisted of 36 musicians (21 females, mean age = 21.2 years, SD = 2.24). All had at least 7 years of formal music training (mean = 12.06, SD = 2.84) and played a musical instrument at a professional level.Footnote 2 They reported an average of 1.58 hours of musical practice per day (SD = 0.60) and started to play musical instruments at the average age of 7.1 years (SD = 1.97). The other group consisted of 36 nonmusicians (20 females, mean age = 21.9 years, SD = 2.11), who had never received any formal music training (except for the basic training provided at schoolFootnote 3) nor played any instruments. All participants were native speakers of Mandarin Chinese. The two groups are matched in age, t(70) = 1.39, p = .17, and fluid intelligence measured by Raven’s Advanced Progressive Matrices, t(70) = 1.28, p = .21.

Materials

The stimuli consisted of 48 Chinese sentences previously used in X. Yang et al. (2014). All sentences were semantically and syntactically well-formed. Each sentence consisted of two clauses with an explicit internal boundary between them, which was the critical boundary of this study. An example of the experimental stimuli is shown in Example (1):

  1. (1)

    [由于/这几天/下] clause 1, [究/进度/受到了/极严重的/影响] clause 2.

[You2yu2/zhe4ji3tian1/xia4yu3]clause1,[yan2jiu1/jin4du4/shou4dao4le/ji2yan2zhong4de /ying3xiang3] clause 2.

[As/these days/raining] clause 1, [research/ progress /under/ severely /affected] clause 2.

[As it has been raining these days] clause 1, [the research progress has been severely affected] clause 2.

The sentences were recorded by a professional male speaker of standard Chinese, a speech trainer working at the Communication University of China. The acoustic parameters were then extracted using the Praat software and analyzed using SPSS. Acoustic analyses revealed that these boundaries were explicitly marked by pauses, final lengthening, and pitch reset (see Table 2 in Supplementary Material A). The recorded sentences served as the all-cue condition stimuli in our study. Out of these 48 sentences, we created another five conditions in which the acoustic cues at the critical boundaries were manipulated. One is the no-cue condition in which all three cues were removed. Another three feature only one of the acoustic cues (pause-only, final-lengthening-only, and pitch-reset-only), which allow us to take a closer look at our participants’ sensitivity to each of the cues. A fifth one is a pause-and-final-lengthening-in-combination condition which enables us to examine listeners’ sensitivity to the combined effect of two durational cues. The methods that we used to create these five conditions were detailed in Supplementary Material A.

In total, 288 sentences (i.e., 48 target sentences with six conditions) were included as experimental stimuli, which were further divided into six lists using a Latin square design. Thus, each target sentence was presented only once (i.e., with only one condition) within each list. To each list, 40 filler sentences with no sentence-internal clause boundaries were also added to balance the “Yes” and “No” responses. In total, each list contained 48 target sentences and 40 fillers.

Procedures

The stimuli were presented through the E-Prime 2.0 software. Before the experiment, six practice sentences were presented to ensure that the participants were familiarized with and understood the procedure.Footnote 4 Each trial began with a fixation (lasted 1,000 ms) in the middle of the screen, followed by a sentence presented via headphone.Footnote 5 At the end of each sentence, a question appeared on the screen. The participants were asked to judge whether or not a boundary had appeared in the sentence they had just heard by pressing J or F on the keyboard. For instance, for Example (1), the question was “Do you perceive a boundary between ‘yu3’and ‘yan2’?”Footnote 6 After the participants gave their response, the subsequent trial began immediately. The whole experiment lasted for about 15 minutes.

Data analysis

For each subject, the proportions of boundaries detected were calculated. Due to the binary nature of the data, a linear mixed-effects modeling with a binomial family was conducted which included two fixed effects: CONDITION (six levels: the all-cue, pause-only, final-lengthening-only, pitch-reset-only, pause-and-final-lengthening-in-combination, and no-cue) and GROUP (musician vs. nonmusician). Participants and items were treated as random effects on the intercept and slope. As our task is not a reaction-time task, and participants were told to wait until the end of the sentences to give their responses, RT is not reported here.

To more explicitly explore the weighting of the cues in the two groups, linear mixed-effects modelling of the cue weights was also conducted. Cue weights were calculated following the Least-squares approach introduced in Kasturi et al. (2002). Each condition was coded with a combination of 0 and 1 (Supplementary Material C), modeling the presence versus absence of each cue. The averaged proportions of boundaries detected in each condition were used to determine the importance of each cue. The weights were then calculated following the four equations as detailed in Kasturi et al. (2002) in MATLAB. For statistical analyses of the cue weights, linear mixed-effects modeling was conducted with CUE (three levels: pause, final-lengthening, and pitch reset) and GROUP (musician vs. nonmusician) as two fixed effects. Due to the averaging of item information when calculating the cue weight values, no random effect of item was included in the modeling. The model with by-participant random intercept and slope failed to converge, so a simpler model with random intercept only was used. For all our analyses, Bonferroni adjustments were employed for multiple comparisons.

Results

For the proportions of boundaries correctly detected (Fig. 1), there was a significant main effect of CONDITION (F = 43.14, p < .001). As shown in Table 1, three conditions (i.e., the pause-only, the pause-and-final-lengthening-in-combination, and the all-cue conditions) on average had a higher rate of correct boundary detection than the other three conditions (|z|s ≥ 7.41, ps < .001). Furthermore, the all-cue condition showed a higher correct detection rate than the pause-only condition (z = 3.01, p < .05). The final-lengthening-only and the pitch-reset-only conditions have higher correct detection rates than the no-cue condition (|z|s ≥ 4.36, ps < .01).Footnote 7

Fig. 1
figure 1

Proportions of boundaries detected (error bars represent standard error of the means).  Note: * p <. 05

Table 1 Pairwise comparisons among conditions for proportions of boundaries detected (corrected p values are reported)

More importantly, there was a significant interaction of CONDITION and GROUP (F = 2.45, p < .05). Planned comparisons showed that the musicians correctly detected more boundaries than the nonmusicians in the pause-only condition (b = 0.76, SE = 0.36, z = 2.14, p < .05) and the all-cue condition (b = 0.98, SE = 0.44, z = 2.24, p < .05), but they reported less boundaries in the no-cue condition (b = −0.68, SE = 0.34, z = −2.02, p < .05). No other comparisons were significant (|z|s ≤ 1.59, ps ≥ .05).

The results for cue weighting revealed a main effect of CUE (F = 95.91, p < .001). This main effect was driven by a weighting bias with pauses weighted more heavily than final lengthening (b = 0.37, SE = 0.03, t = 12.36, p < .001) and pitch reset (b = 0.34, SE = 0.36, t = 11.21, p < .001). The Latter two cues were not weighted differently (b = −0.04, SE = 0.03, t = −1.15, p > .05). More importantly, there was an interaction between CUE and GROUP (F = 5.86, p < .01). Simple effects analyses showed that for the musicians, pause was weighted more heavily than final lengthening (b = 0.45, SE = 0.04, t = 10.47, p < .001) and pitch reset (b = 0.44, SE = 0.04, t = 10.21, p < .001), with the latter two cues not differing from each other (final lengthening vs. pitch reset, b = −0.01, SE = 0.04, t = −0.26, p > .05). For the nonmusicians, pause was also weighted more heavily than the other two cues but with smaller degree of differences (pause vs. final lengthening, b = 0.30, SE = 0.04, t = 7.01, p < .001; pause vs. pitch reset, b = 0.24, SE = 0.04, t = 5.64, p < .001). Again, final lengthening did not differ from pitch reset ( b = −0.06, SE = 0.04, t = −1.37, p > .05). These results, as plotted in Fig. 2, suggest that the weighting bias was more pronounced for the musicians than the nonmusicians.

Fig. 2
figure 2

Weighted values of the acoustic cues for the musicians and nonmusicians (error bars represent standard error of the means). Note: ** p < .01; *** p < .001

Discussion

The goal of the current study was to shed light on the impact of music training on clause segmentation. We were interested in how musicians and nonmusicians may differentially utilize acoustic cues of clause boundary to parse speech. Results showed that regardless of music background, the proportion of correct boundary detection was higher in three conditions (i.e., the all-cue, pause-and-final-lengthening-in-combination, and pause-only condition) than when only pitch reset or final lengthening was present. The no-cue condition, as expected, elicited the lowest number of boundary detection. Analyses of cue weights also showed that pauses were weighted much heavier than final lengthening and pitch reset. Similar cue weighting patterns have been confirmed in the conceptual replication of this experiment (see Supplementary Material E for details). These findings have implications for the role of the acoustic cues in clause segmentation. That is, in Mandarin Chinese, the perception of clause boundary is heavily dependent on the presence of pauses. Our results replicated the findings reported in X. Yang et al. (2014) for Standard Chinese and compare favorably with previous findings for other languages. For example, Holzgrefe-Lang et al. (2016) have reported that in German, final lengthening or pitch change alone would not be adequate to cue boundary segmentation.

Interestingly, the combination of pause and final lengthening did not facilitate correct boundary detection for both musicians and nonmusicians, supporting the view that the number of cues does not necessarily lead to an additive facilitatory effect. Instead, it is the specific cue constellation that is decisive for the detection of a boundary (Wellmann et al., 2012).

Several noteworthy findings related to the impact of music training were observed. First, in this experiment and the conceptual replication experiment reported in Supplementary Material E, we consistently observed a higher proportion of correct boundaries identified by the musicians in the all-cue condition, which is in line with previous reports that music training is associated with better performance in speech processing (Besson et al., 2011, for a review; Marie et al., 2011). Complementary to this finding is that when the acoustic cues for the internal boundary were not present (i.e., the no-cue condition), fewer cases of boundary were reported by the musicians than by the nonmusicians. This suggests that the musicians out-performed the nonmusicians in identifying boundaries that are clearly marked by acoustic cues and in rejecting cases that lack supporting acoustic cues expected for the boundary.

As aforementioned, syntactic clause boundaries are typically marked by pauses, final lengthening, and pitch changes. In music, it is also important to chunk and mark boundaries with pause and the length of pre- and postboundary notes (Zhang et al., 2016). In the no-cue condition, the critical (syntactic) boundaries were presented without acoustic cues, which is against the prototypical mapping between syntax and prosody in speech. We did not explicitly tell our participants what exactly a “boundary” is. Our results in the no-cue condition could have resulted from multiple factors. It is possible that the nonmusicians relied more on semantic or syntactic knowledge to detect the clause boundaries, while the musicians perhaps relied more on a combination of syntactic and prosodic information.

It is also likely that years of music training have helped the musicians in our experiment to develop higher expectations for the acoustic marking of speech boundaries than nonmusicians, which, in turn, enhanced their abilities to detect the presence as well as the absence of boundary cues. A related possibility is that years of musical training have sharpened the musicians’ attention mechanisms in general, which enabled them to focus better on the task at hand (Besson et al., 2011; Patel, 2012). While it would be interesting to tease apart these two possibilities, both mechanisms could have been at play for our results.

Note that in the conceptual replication of this experiment with a different set of stimuli (see further details in the Supplementary Material E), the rate of boundary detection in the no-cue condition was much higher in both groups. Furthermore, only a trend of the group effect was replicated, with a reduced magnitude. So, the differences between the two groups in the no-cue condition should be interpreted with caution. We suspect that the musician and nonmusician groups as well as the individuals within each group might have reacted dynamically and differentially to the syntactic and semantic information in the signal in the absence of explicit acoustic cues for boundary. The interaction of syntactic/semantic information and acoustic cues should be explored in further studies.

A second interesting observation regards the musicians’ use of the cues (which is replicated and reported in Supplementary material E). Of the three conditions that featured only one of the acoustic cues (i.e., the pause-only, the final-lengthening-only, and the pitch-reset-only conditions), the musicians differed from the nonmusicians in the pause-only condition. While the analyses of cue weights revealed that both groups showed a weighting bias towards pause as the most important cue for clause boundary segmentation, it is important to note that this weighting bias was more pronounced in the musicians.

Note that pause has been shown to serve as the most reliable cue to mark clause boundaries in Mandarin Chinese (X. Yang et al., 2014) and English (Scott, 1982). Our results could be an indication that music training does not just enhance the general perceptual acuity to acoustic cues. Rather, the training enabled the musicians to become selectively more acute to the highly weighted cue(s) when the typical constellations of acoustic cues are not available to secure correct identification. In this way, the musicians out-performed the nonmusicians in their ability to identify the appropriateness of acoustic marking of clause boundaries and to benefit from the insufficient but reliable cues, thereby achieving more efficiency in boundary identification.

An alternative account for the musicians’ reliance on pause in our study could be a purely language-specific effect, related to cross-language differences in cue-use in speech segmentation (Tyler & Cutler, 2009). Mandarin Chinese is a tone language where pitch variations are used to distinguish lexical tones; pitch rises and falls at speech boundaries mark not only boundary related changes, but also lexical tone contours (to distinguish word meanings; Chen & Gussenhoven, 2008; Y. Yang & Wang, 2002). This may have limited the extent to which pitch change is used to mark clause boundaries. As such, Mandarin listeners may rely more on pauses in general. The fact that the musicians relied on pauses even more and outperformed the nonmusicians in the pause-only condition could then be taken as that they are more attuned to native language-specific patterns of boundary cues for speech segmentation. In either case, the findings observed in this study add important knowledge to the literature on cue-weighting and speech segmentation.

It is important to note that our results do not establish a causal relationship between musical training and enhancements in clause segmentation given that direct evidence of causality would only be possible based on effects obtained in training paradigms (e.g., Chobert et al., 2012). As our observed results were based on an offline boundary detection task, it would be revealing to run an ERP experiment to examine the online perception of these acoustic cues by musicians and nonmusicians in a more naturalistic speech comprehension experiment. Finally, we would like to draw readers’ attention to the fact that for two-response discrimination tasks, it would be good to employ a sensitivity measure like d’, which can be estimated from data that consist of both hit and false alarm rates per condition (Botella & Suero, 2020; Verde et al., 2006). In the present study, we could not calculate d’ because all the conditions were constructed with clear syntactic/semantic boundaries and we did not have distractors items to calculate false alarm rate per condition. Thus, methodologically, there could be improvement for future studies.

To conclude, the present study extended prior research that has investigated musical expertise’s effect in speech segmentation, with tasks mainly tapping into low-level auditory perception (Bidelman et al., 2011; Du & Zatorre, 2017). Our results revealed that even at the higher clause/sentences level, musicians out-performed nonmusicians in the detection of internal clause boundary and that they showed more acute sensitivity to the most reliable cues relevant to clause segmentation. The findings observed here complement the existing literature by bringing substantial evidence to the impact of music training on high-level speech processing and contribute new knowledge to our understanding of acoustic cue weighting and speech processing.