Because I have cancer I have time to reflect which is nice.
This is a very personal summary of my thoughts on the field of computer generated speech production, a field that I have grown to love despite the fact that I stumbled into it pretty much by chance (thank you Alistair Edwards at York and Roger Moore at Sheffield). It’s not an academic article. It’s very opinionated and is not intended to be scholarly or balanced. So if you want that then don’t read this.
Nor am I going to reference the work of others, I got so bored doing that when I wasn’t ill, at least grant me the privilege of being reference free while I am ill. Nor do I proof read so don’t be tyresome and point out my typoos. If this activity gives you satisfaction, well go get a life. Try it’s so liberating to write something like a novel rather than something like a yellow pages for academic worthies. Instead I will freely and selfishly explore my personal motivation to engage with the field and I will ask the big questions I ponder on from time to time. I have posed these questions a number of times to specialists in the research community and to speech industry people. My habit is to corner them at conferences over an indifferent, soggy pastry and cold coffee while I am ‘performing’ as the token artist. Their reaction can be to assume that I am threatening their livelihood or reputation by suggesting that computer speech production that aspires to human verisimilitude is both a hopeless cause and ultimately deadly dull in comparison to the wealth of possibilities they have at their disposal to create new voices that express themselves in ways never heard before. This has frequently resulted in the cold shoulder at the equally indifferent, soggy buffet lunch or a look of startled incomprehension and suspicion. Anyway either I have been too arrogant and smug in my enclosed artistic sensibility, I didn’t listen, could not understand their response or have forgotten what it was, but I am still not satisfied that I have received any wholly convincing answers yet. So perhaps my readers could help me out and suggest a few. Unlike academic professionals harsh, vulgar and unjustified criticism is the bread and butter of the theatre professional so don’t hold back – but make sure you can take what you dish out.
When a machine speaks it blurs the boundaries between the human and the machine. It transports the listener into a curious imaginative space akin to diving down Alice’s rabbit hole. That’s obvious but easily forgotten once you are immersed in the field. It’s a very exciting space, full of creative possibilities, after all the opportunity to fashion human speech from scratch is a god or golem like opportunity, the stuff of poets, composers, actors, visionaries, the mad and megalomaniacs. For me as a theatre person it combines voice coach, dramatist, director and actor into one glorious whole that is unencumbered by budgets, producers and stroppy actors. For the more technically literate than me it frees the creative voice developer/designer from the constraint imposed by conventional human speech allowing her to develop clever new voices that may never have been heard before, that may be more effective at side stepping some of the issues, the Uncanny Valley being one, that can confound attempts at human verisimilitude.
So my first question is:Why are all the research efforts being directed toward inevitably stumbling down the ‘Uncanny Valley’ by creating nearly-but-not-quite-human verisimilitude, rather than cleverly avoiding it by dropping the human verisimilitude goal? Or to put it another way, isn’t an interesting computer voice better than a dull nearly-but-not-quite-human computer voice?
There may be an argument that the Uncanny Valley is not an inevitable obstacle for human-like computer speech but the point remains, why is no one trying this alternative approach? After all it got to be more fun to make something new than copy something that already exists. To me it is reminiscent of the efforts in Russian realist theatre when in order to realistically represent a forest on stage they planted a forest on stage – the result was that the leaves fell of the trees and the actors slipped on them. In fact a tree can be quite successfully represented by the imagination of the audience if it is stimulated so to do.
My second question: Is the objective of the synthesis community to make computer speech as human-like as possible or as intelligible as possible? Clearly the two do not amount to the same thing. Think of Marlon Brando in the Godfather. I won’t list here the factors that contribute to human-like speech, I don’t think I know them but intelligibility has been a variable manipulated by method actors ever since James Dean successfully impersonated a disgruntled teenager in ‘Rebel Without a Cause.’ Once we head into the territory of rendering emotion synthetically, intelligibility has to be a variable subject to negative settings. If this factor is going to be subject to some sort of regulator then this is counter to the principle of the goal of human verisimilitude and thus opens the door for a whole bunch of other constraining factors that will also undermine any efforts in this direction. My point is this; many designers believe that some sort of happy medium can be found that in some way encapsulates a normalised version of human-like speech that is broadly credible and intelligible. I think this will be as freakish as images made up of averaged images of human faces. It will be just another scary ‘robot’ on the Uncanny Valley continuum.
A speaking computer is like a technically incompetent voice actor. I use the term actor deliberately. Too many computer speech specialists think of the voices of machines as pretend people; they are not pretend people they are pretend actors pretending to be people. This is not just a quibble this is an important distinction for the listener’s suspension of disbelief. A person has a body, history, mum and dad, mortgage etc, an actor has the potential for many manifestations of these properties while a pretend actor has these properties squared. Thus the palette of creative possibilities derived from combining these properties is that much greater.
In the old days incompetent voice acting would be fixed by elocution lessons (hence the esteemed Royal Central School of Speech and Drama) this would initially address just the technical aspects of speech production, breathing, range, tone of voice, diction, projection etc. Once that had been fixed the lessons would progress to aspects of interpretation, matching the voice with the text, creating characters, accents and emotions. Modern speech synthesis developers suggest we are somewhere between these two stages, intelligibility is largely solved but other features such as emotional speech are still far from resolved. There is an assumption that the journey from bad actor to good actor for a computer voice actor is fixable like that of a human voice actor as described above. This is wrong. Bad acting for a human actor is forgivable, bad acting for a machine denotes broken and is unforgivable; thus an initial bad choice by a human actor is assimilated into the experience without interrupting the flow of the performance toward a proper state of willing suspension of disbelief. Any bad choice by a computer actor irrevocably branches the flow of the performance away from willing suspension of disbelief toward no suspension of disbelief at all. Thus my point is that for the computer actor it is a ‘one strike and you are out’ situation. Given that a strike usually occurs within a few seconds of hearing a computer voice anything that is done after that first strike is a waste of time. The performance is dead in the water and the only thing left to do is to give the audience their money back. Outside the theatre there may as be a sign saying ‘no humans are involved in the performance so do not expect to believe in it or enjoy it – it’s a charade.’
So my third question is does the community recognise the fact that once a computer voice has shown itself to be just that, a computer voice, and not a human voice any attempt to pretend to be human from then on is futile. Or to put it another way once the magician has shown how the trick works it is no longer magic.
There is an argument that talking computers are not attempting to fool people into believing they are human. That is a sensible objective however it does not take into account what the user actually thinks the voice is doing rather than what the designer has set out to persuade the user to think. Basic knowledge of the history of HCI shows how often this bifurcation occurs. Unless the voice spends valuable time explaining what it is up to, along the lines of “don’t be fooled I am only pretending to be human” the first reaction of any user will be to assume the voice is human. When you first hear a dog bark you expect the source to be dog not a recording of a dog or a dog impersonator. Once the user has perceived the voice to be a machine, usually in a matter of seconds then the relationship between the two changes from one of person to person to person to performer of person. As a performer of a person with clearly no human source the expectation is that the persona projected will not have genuine human attributes such as emotions, in fact humanlike emotions are more likely to be a distraction in so far as they merely provide opportunities for the designer to display virtuosic vocal capabilities that the listener knows to be fake.
So my fourth question is: What is the point of a computer voice pretending to experience an emotion if nobody will ever believe it? Isn’t this like a footballer diving for a penalty when everyone can see the ball is at the other end of the pitch and no other players are nearby? Isn’t this more about the research community showing off how clever they are rather than making viable speech applications and improvements to the user experience?
The counter argument to all this rests in the answer as to whether true human verisimilitude is the goal of the computer speech design community or whether something close to human verisimilitude is the best way of optimising intelligibility and thus making viable applications that make use of computer generated speech. If the former is the case then I would say it is a goal which is unlikely to be achieved until a comprehensive model of a human in a comprehensive model of the universe has been built. By then it won’t be necessary because we will be communicating telepathically or through Iphone 61000. After all speech is just an imperfect way of audibly communicating the human perception of the infinitely complex system we are part of. Until there is a model of the system any synthetic render will be imperfect squared. If the goal is the more realistic latter then there may be many other ways of optimising intelligibility that don’t circumscribe the efforts of the research community to futile efforts to render human verisimilitude.
Here’s my final question. This is a new one. Any actor will tell you that critical for a good performance is to listen afresh to what is being said to you by other characters. One should try to hear things for the first time. An actor who trots out their lines without listening will sound very phoney. Listen to inexperienced child actors to get the idea. The process of listening is complex. You need to listen to yourself, to the other characters and in a live situation to the audience. Information from these three sources will change how you say your lines such that two performances of the same play are likely to be very different indeed. Is anyone considering the listening factor when designing computer speech? Assuming the system is not intended to be interactive it still seems viable to have some means by which the voice can listen to itself, modify and learn to preview its next utterance against an archive of previous utterances. One of the most telling factors, that I am sure has an associate linguistic theory (someone tell me please) is an awareness that something you are saying has already been said either by you or by somebody else. This can be a conceptual reference or a literal repeat either way it is critical to make the appropriate change in inflection to denote awareness of the iteration otherwise intelligibility is severely undermined.
Why do these things bother me? Since properly encountering the scientific community in the last ten years I have been deeply impressed. The scientists I have met are among the most creative, imaginative and insightful people I have ever encountered in a 35 year career amidst some of the so-called artistic and theatrical elite, in fact at first I was foolishly besotted rejecting my artist colleagues as plonkers while I sought the exclusive company of my new gurus. I have calmed down now but eagerness to persuade the scientific community that the ‘rogues and vagabonds’ of theatre who are incapable of seeing the world numerically often incapable of fixing on any single theory and often quite uninterested in evidence, preferring intuition, do know something about pretending. Given that the art of artificial voices is about pretending with knobs on we really ought to work together to come up with some cool fun stuff.
I have just spent about 40 hours modifying a performance by a DEC Talk DTC01 of a poem. It’s a fantastic tool with loads of controls. I manipulated just the speech rate, pause duration and stress although much, much more is possible. I believe the result is now, just about intelligible, and by that I mean little more than being able to understand the words in the context of the verse and the poem as a whole. 40 hours spent on one poem manipulating only 3 variables by an experienced director of speech – we cannot underestimate how difficult these challenges are nor how unlikely it is that a partial solution will get anywhere close to “really working” or more specifically to persuading users that computer generated speech is any good at all acting human. Time for a change in direction? How about computer speech with a musical accompaniment to denote the emotional content?