Fascinating article by Woelfel, Schlippe, and Stitz. If I understood it properly, different qualities of a an audio file or spoken voice could be interpreted and impact how the words spoken would be represented typographically. In other words, volume might impact size of the type, font used, leading, if the type was ragged, or what have you.
In terms of captioning, I thought this might be interesting in terms of potentially providing an algorithmic approach for caption generation. Might be a good way to show anger, calm, seduction, etc. Then again, this might also require viewers to learn an entirely new vocabulary or develop an additional literacy to interpret what specific things, like size, font, or color, might mean.
Then there's are a few potential problems: would each company potentially have its own algorithm and thus have no consistent standard? Would the standard be free and given away? Would there be a charge for it? Could this only appear in videos or films that could afford it?
One approach that I think might work is, of course, inspired by the epic and experimental captioning done inNight Watch and the new Sherlock Holmes. Rather than trying to do these kinds of effects on all text, perhaps these could be used for such things like representing NSI (non-speech information) on screen and single word utterances, such as profanities and exclamations. Thus they would be visually different from the normal presentation of utterance and sounds, and yet they could also embody certain components of the utterances in their type design.
Plenty to think about from this piece. Hope that we see more of this type of work--and maybe some of it will enter the realm of captions. Yes!