Description
Using the cmu-slt
unit-selection voice, the TEXT
uh.
uh.
oh.
has boundary durations predicted as ACOUSTPARAMS
<?xml version="1.0" encoding="UTF-8"?>
<maryxml xmlns="http://mary.dfki.de/2002/MaryXML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="0.5" xml:lang="en-US">
<p>
<s>
<phrase>
<t accent="!H*" g2p_method="lexicon" ph="' V" pos="UH">
uh
<syllable accent="!H*" ph="V" stress="1"><ph d="398" end="0.398275" f0="(0,165) (50,267) (100,235)" p="V"/></syllable>
</t>
<t pos=".">
.
</t>
<boundary breakindex="5" duration="400" tone="L-L%"/>
</phrase>
</s>
</p>
<p>
<s>
<phrase>
<t accent="!H*" g2p_method="lexicon" ph="' V" pos="UH">
uh
<syllable accent="!H*" ph="V" stress="1"><ph d="398" end="0.398275" f0="(0,165) (50,267) (100,235)" p="V"/></syllable>
</t>
<t pos=".">
.
</t>
<boundary breakindex="5" duration="400" tone="L-L%"/>
</phrase>
</s>
</p>
<p>
<s>
<phrase>
<t accent="!H*" g2p_method="lexicon" ph="' @U" pos="UH">
oh
<syllable accent="!H*" ph="@U" stress="1"><ph d="338" end="0.338394" f0="(0,165) (50,311) (100,235)" p="@U"/></syllable>
</t>
<t pos=".">
.
</t>
<boundary breakindex="5" duration="400" tone="L-L%"/>
</phrase>
</s>
</p>
</maryxml>
Note the constant duration="400"
(ms) for each boundary
element.
But when this is actually synthesized, the REALISED_ACOUSTPARAMS
becomes
<?xml version="1.0" encoding="UTF-8"?>
<maryxml xmlns="http://mary.dfki.de/2002/MaryXML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="0.5" xml:lang="en-US">
<p>
<s>
<phrase>
<t accent="!H*" g2p_method="lexicon" ph="' V" pos="UH">
uh
<syllable accent="!H*" ph="V" stress="1"><ph d="88" end="0.088750005" f0="(0,165) (50,267) (100,235)" p="V" units="V_L arctic_a0146 10273 0.045; V_R arctic_a0146 10274 0.04375"/></syllable>
</t>
<t pos=".">
.
</t>
<boundary breakindex="5" duration="200" tone="L-L%" units="__L arctic_b0385 67582 0.2"/>
</phrase>
</s>
</p>
<p>
<s>
<phrase>
<t accent="!H*" g2p_method="lexicon" ph="' V" pos="UH">
uh
<syllable accent="!H*" ph="V" stress="1"><ph d="88" end="0.088750005" f0="(0,165) (50,267) (100,235)" p="V" units="V_L arctic_a0146 10273 0.045; V_R arctic_a0146 10274 0.04375"/></syllable>
</t>
<t pos=".">
.
</t>
<boundary breakindex="5" duration="200" tone="L-L%" units="__L arctic_b0385 67582 0.2"/>
</phrase>
</s>
</p>
<p>
<s>
<phrase>
<t accent="!H*" g2p_method="lexicon" ph="' @U" pos="UH">
oh
<syllable accent="!H*" ph="@U" stress="1"><ph d="246" end="0.2468125" f0="(0,165) (50,311) (100,235)" p="@U" units="@U_L arctic_a0105 7295 0.0880625; @U_R arctic_b0352 65184 0.15875"/></syllable>
</t>
<t pos=".">
.
</t>
<boundary breakindex="5" duration="200" tone="L-L%" units="__L arctic_b0352 65185 0.2"/>
</phrase>
</s>
</p>
</maryxml>
Note how the specified boundary duration
s have been halved from 400 to 200 ms.
Furthermore, by inspecting the PRAAT_TEXTGRID
or similar, we can plainly confirm that the boundaries are only 0.2 seconds long.
And the units tier tells us which units from the unit-selection database are selected to render the boundaries as pauses.
Interestingly, dumping and inspecting the voice data reveals that those units (indices 67582 and 65185) are actually 0.1284 and 0.1529 seconds long, respectively.
TL;DR: The duration
attributes of boundary
elements have their specified values reduced by 50% when synthesizing from the specified ACOUSTPARAMS
to REALISED_ACOUSTPARAMS
, and the lengths of the corresponding pauses are accordingly wrong.