Vocaloid: Give Me Your Vowels, Your Syllables, Your Huddled Consonants...
by Marjorie Dorfman

The whole of science is nothing more than a refinement of everyday thinking.
– Albert Einstein

How can a new software duplicate all voices since the beginning of recorded time? Vocaloid is the answer and it promises to shake the music world by storm. Read on for more information on a startling innovation.

Surely one of the most amazing new developments has to be Vocaloid, the incredible singing synthesis technology created by Zero-G Limited, a world leader in digital audio and the Yamaha Corporation, the world’s largest manufacturer of musical instruments. Presented at the 2003 Musikmesse in Germany and the Audio Engineering Society convention in the Netherlands last March, Vocaloid mimics the singing voice with a disembodied but surprising accuracy. According to Bill Werde of The New York Times, this new software promises to revolutionize the copyright repercussions of the phrase "losing one’s voice."

Yamaha’s Chief Technical Officer, Masatada Wachi says, "the software features simple commands enabling users to add expressive effects and as Vocaloid runs on Windows-based PCs, we hope that amateur enthusiasts as well as professionals will enjoy creating music with great sounding vocals." According to Zero-G founder and Managing Director, Ed Stratton, "the precise control of melody, words and expression represents a huge leap in potential for everyone, from producers of classical music right through to those working at the cutting edge of dance music and more experimental genres. The results…really have to be heard to be believed."

Vocaloid allows songwriters to generate superb, authentic-sounding singing on their PCs by simply imputing the words and notes of their compositions. In other words, for $200 (the estimated cost of the first generation of software) I can rig my cyber universe so that Elvis will hold all my calls, Nat King Cole can wake me in the morning and Julio Iglesias can read to me at night, (or at least on those nights when I’m not in the mood to listen to the sad ballads of my favorite female vocalist, Billy Holiday).

The amazing process begins with recordings of professional male and female vocalists singing specially constructed phrases of nonsense words with all possible transitions between syllables. The transitions are slightly different depending on the combination of speech sounds called phonemes (not phone home of ET fame). Those differences are a big part of how we understand words and why a vocal track sounds either natural or artificial. For example, the phoneme p sounds slightly different at the beginning of a word than it does at the end, and it affects the vowels next to it differently than the phoneme t.

Sound is synthesized from "vocal libraries" of recordings of actual singers, retaining the qualities of the original singing voices to reproduce real-sounding vocals. Up to now, synthesizers worked well in the simulation of musical instruments but could never duplicate the human singing voice, except in a monotone or robotic way. This is due to the wide range of timbres, articulations and transitions between sounds. Also, the combination of lyrics as well as melody results in a double layer of meaning not found in other instruments. The human ear, too, is so attuned to the voice that the subtlest tonal shifts, errors or anomalies are immediately apparent.

Recorded phrases are converted to the frequency domain and divided into separate phonetic transitions. Those elements are then stored in a phonetic database for use with the synthesis engine. Expressive elements such as vibralto are also extracted and stored in a separate database. To create a vocal track, music and lyrics are entered into the score editor. To synthesize vocal parts, the system retrieves data consisting of voice snippets, applies pitch conversion and splices and shapes them to form the words of a song. As this processing is done at the frequency-domain level, pitch can be easily changed according to the specific melody, and the voice snippets can be spliced in a way that reproduces smooth flowing words.

If you are anything like me, you hear a word like snippet and think it’s refers to a Muppet with an attitude. Rest assured you are not alone and I will attempt to define these terms, even though they are Greek to me as well. A snippet is an object, which acts like a little library with a specific code in it. Still not clear? Well, join the club. Does that mean I need a card? Am I allowed to borrow or must I always lend? Where do I go to return what I have borrowed? These and other penetrating questions leave me in the dark (where I suppose a low techie like me belongs).

Another term that I found most confusing was frequency domain. Research informed me there were two domains; one time and one frequency, but what that means (at least to me) is about as clear as mud. My first thought, due to the familiarity of the word domain placed it as a spot where frequency lives, but I secretly suspect only contempt arising from where familiarity once stood. The closest I can come to understanding it is to refer to it as a place where frequencies frequent and frequently cause both trouble and good things.

Bill Werde of The New York Times in his article: "Could I Get That Song in Elvis Please?" says: "the market for synthesized voices extends well beyond recorded music. For example, cell phone ring tones, a rapidly expanding field, already use synthesized voices to personalize incoming calls. The DA Group, a Scottish company, uses patented technologies to animate several popular stars, including Ananova, the British newscaster who exists solely online…and Maddy, the bank teller avatar who is being tested on ATMs in several markets around the United States."

What this may mean for all of us in the near future is a new paranoia arisen from the ashes of a hi-tech phoenix. The next time you call your aunt, can you really be sure it’s her on the other end? What if some alien force decided to snatch voices instead of bodies, like in that old sci-fi thriller of the 1950s? Instead of giant pods, alien voices might rest in cradles resembling phone receivers (a new kind of call waiting) until they get the strength to take over all the real voices in the world. Telephone wires and brain waves might even mesh and create a new vocal monster, Frankenoid, all equipped with cloddy shoes and nuts, bolts and a key that turns for brains (e flat or a sharp?). In short, nothing would ever be as it seems unless what it seems is what it is! (Huh?)

I have to stop now because I am getting a headache. I think I’ll call my sister. But even if she answers, can I be sure that she’s really at home?

Did you know . . .

Copyright 2004