Tulevaisuuden ensimmäinen ääni

Toinen musiikkipostaus tähän heti perään, mutta tällä kertaa Jotain Aivan Erilaista. Keväästä koulussamme pidettiin lopputyöhön niin esiintymisen kuin kirjoittamisen puolesta valmistava mediatekniikan seminaari, ja valitsin aiheekseni Vocaloid-ohjelmiston ja niiden ympärille kehittyneen fanikulttuurin ja sen mahdollisen merkityksen tulevaisuudessa – rautaisen analyyttisestä insinöörinäkökulmasta, totta kai!

Lieneeköhän ensimmäinen kerta kun suomalaisessa blogissa julkaistaan akateeminen teksti? Tyyli eroaa sattuneesta syystä blogikirjoittelusta melko paljon, teksti on englanniksi ja aika pitkä, mutta kenties se silti saattaa olla jollekin mielenkiintoista luettavaa!

1. Introduction

Synthesizers have already been used in music for several decades to emulate instruments and provide entirely new sound. While voice synthesizers have also existed for a long time, they have been mainly accessibility tools, and actual singing synthesizers have not been utilized until recently.

However, with Yamaha’s release of the Vocaloid voice synthesizer in 2003 and the evolution of the internet as a distribution channel for musicians who otherwise would have very little possibility to distribute their works, the vocal synthesizer has created an entirely new genre of indie music that is also penetrating into the mainstream in some parts of the world.

This paper explores the voice synthesizing technology, its applications and what the possibility of a vocal synthesizer and an associated virtual idol can mean for a musician.

2. History of Voice Synthesizers

Historically, voice synthesizers have been used mainly for accessibility; text-to-speech programs have existed for over 30 years, and their primary function is assisting visually impaired people in computer use. The first home computer with text-to-speech capability was the Atari 1400XL, released in 1983 [1]. A year later, all Apple Macintosh home computers had MacInTalk software installed in them, which could read any text input aloud [2]. Apple has continued developing their text-to-speech program, and new versions of it can be found in OS X and even in the iPhone and iPod Touch, which surprisingly sounded very good and convincing, at least when the operating system language was in Finnish. However, even if these text-to-speech applications have developed greatly over time, they are purely accessibility tools and hold no actual entertainment value.

3. Vocaloid Software From Yamaha

3.1. Introduction

Twenty years after the first text-to-speech computer applications, electronics and musical instrument company Yamaha introduced a voice synthesizer called Vocaloid. In contrast to the accessibility use of previous synthetic voice programs, the purpose of Vocaloid was to create a tool for musicians to synthesize the human singing voice.

The Vocaloid software itself is not sold commercially from Yamaha to private parties, but specialist companies such as Zero-G from Great Britain and Crypton Future Media from Japan created their own voice libraries utilizing Vocaloid technology. These products were then made available to everyone [3]. In 2007 a revision of the program, Vocaloid2 was released [4].

3.2. Working Mechanics

The Vocaloid engine itself consists of three parts: the Score Editor, the Singer Library and the Synthesis Engine. The Synthesis Engine provided by Yamaha is the basis of the program, which matches the most fitting samples from the Singer Library provided by third-party companies, based on what the end user producing the music has input in the Score Editor. [5]

The Score Editor features a piano-roll interface, where the person creating the music can input lyrics, adjust pitch and timing and modifiers to create a desired, realistic voice, as seen in the following Figure 1.

Fig. 1: The Score Editor view of Vocaloid2 [6].

As the screenshot of the Score Editor shows, the horizontal axis represents the timing of the song, while the vertical axis represents the scale. Lyrics can be typed either in English or in Japanese. When the language is English, the program converts the regular phrases into phonetic expressions, or when the language is Japanese either hiragana- or katakana-lettering or Romanized spelling of Japanese can be used. Various modifiers can be dragged on the lyrics, or a previously composed MIDI melody can be imported into the program and lyrics can be added on top of the melody. Synthesizing spoken word is also possible, but achieving a natural-sounding intonation is fairly difficult as the function of the program is synthesizing a singing voice. The Score Editor is the only part of the Vocaloid software that is visible to the actual end user. [7]

The hardest part of creating a natural-sounding Vocaloid, a synthetic voice utilizing the Vocaloid technology, is the creation of the Singer Library. Third-party companies such as Zero-G or Crypton Future Media create Singer Libraries by taking samples of a real singer’s voice and storing the samples in a database. All combinations of different phonemes must be recorded per each pitch, and depending on the language of the sampling, a lot of work may be required. In English, there are approximately 2000-2500 phonemes per pitch that must be sampled, and for efficient recording there is a special script which features all the possible combinations with minimal amount of repetition. [5] Other languages are phonetically different, Japanese, for example, only has approximately 500 different samples per pitch which need to be recorded, and the number for Finnish is probably rather close to that. In fact, due to the phonetic similarity of Japanese and Finnish, it is possible to use the current Japanese-language Vocaloids to synthesize singing in Finnish with a little creativity and imagination.

The Synthesis Engine then ties the user input from the Score Editor together with the sample set which is provided by the chosen Singer Library. Based on the lyrics, proper samples are picked from the library and transformed and scaled using mathematical methods to create the final voice output. [5]

However, initially the program raised interest mainly in professional music producers and did not have any mainstream success until the first release of Crypton Future Media utilizing the Vocaloid2 engine.

4. Vocaloid Hatsune Miku

4.1. Introduction

Crypton Future Media released their first Vocaloid2 product on August 31st 2007. Their idea was to expand the concept of Vocaloid further from a piece of computer software and a purely electronic voice to an actual virtual singer, to bring the product also closer to the general public and not just to professionals. [8]

Manga artist KEI was hired by Crypton to create a visual image for their “Character vocal series” of Vocaloids. The first and most notable of the series was CV01 Hatsune Miku, “an android diva from the near-future world where songs are lost”. Miku’s Japanese name, 初音ミク, translates into English loosely as “the first voice of the future”. KEI’s design, shown in figure 2, was based on the signature blue-green color of Yamaha’s synthesizers and the concept of an android singer. [9]

Fig. 2: Character sketches of Hatsune Miku along with the finished design [8].

The approach of utilizing the manga-like character in figure 2 proved extremely popular, and Hatsune Miku was a huge sales success – the initial demand was so high that Crypton could not keep supplying retailers with the program. During the first two weeks of sales, nearly 3000 reservations were made, which for music software is a very large number. In September 2007, Japanese Amazon webstore informed that the sales of Hatsune Miku totaled over 57.5 million yen (approximately €500,000) making it the best selling software at the time. [10]

Achieving high sales figures did, however, require a lot of effort from Crypton’s part. As the Vocaloid voice banks are constructed from sampling real singers, the search for the voice of Miku proved somewhat challenging. Initially, Crypton Future Media approached professional vocalists, who refused to provide their voice samples, fearing that if the software replicated their voice, there would not be any demand for them any more. After this, Crypton changed their approach, and a less known voice actress Fujita Saki was chosen to provide her voice for Miku. [11] Curiously, the concerns of the professional singers might have held some value, as Hatsune Miku is very popular while Fujita Saki herself has only managed to release one CD single. Another explanary factor could also as well be that a musical career is not her aim [12]. However, there are also a few voice banks of professional singers, such as Internet Co. Ltd.’s Lily, sampling the electronic band m.o.v.e.’s vocalist Yuri Masuda [13] and Gackpoid which samples the rock singer Gackt [14], and (at least to my knowledge) the Vocaloid products have not decreased the original singers’ popularity at all.

Crypton continued their Character vocal series with two additional voicebanks – CV02 Kagamine Rin & Len (young boy/girl voice provided by Shimoda Asami) and CV03 Megurine Luka (a more mature, bilingual female voice sampled from Asakawa Yuu), and even though they are also fairly popular, they have not achieved Miku’s breakthrough level of success. [15, 16] So far, over 20 different singer libraries have been released by various western and Japanese companies. [17]

4.2. Cultural Impact

As previously mentioned, Crypton Future Media’s Character vocal series of Vocaloid software was a breakthrough success, and the sales numbers it achieved also meant that a lot of music was produced using the software. The key element for the Vocaloid phenomenon was Japanese YouTube equivalent Nico Nico Douga (nicovideo.jp). The copyright and monetary concerns of Vocaloid products will be discussed later on, but in the initial phase Vocaloid songs were created by indie producers and released free of charge on Nicovideo, encouraging people who like the music to create derivative works based on the songs. Remixes and in particular artwork and videos featuring the Vocaloid characters started surfacing rapidly, and now a search for Vocaloid or Hatsune Miku either on YouTube or Nicovideo will give hundreds of thousands of results.

However, some artists have managed to use the Vocaloid software as a jumpstart to actual mainstream success, the most significant of these being a j-pop group called Supercell. Supercell features a musician called of Ryo and a group of artists who provide supporting imagery and videos for the songs. In late 2007 Ryo uploaded his song “Melt” on Niconico and it became a huge hit, and has so far generated nearly 7 million views [18]. After several more Niconico hits and an independent album release, Supercell signed a contract with Sony Music and their 2009 self-titled debut album with Hatsune Miku providing vocals for all songs peaked as number four on the Japanese Oricon albums sales chart, with 56,000 albums being sold in the first week [19]. Currently, in the spring of 2011, Supercell is working on their follow-up album “Today Is a Beautiful Day” with an actual human singer. [20]

Everyone obviously does not have a chance at signing with Sony Music. To assist musicians who utilize Vocaloids commercially, Crypton Future Media also runs a record label called KarenT, through which songs are distributed globally for example on iTunes and AmazonMP3. [21] A search in the Finnish iTunes store for Hatsune Miku returns nearly a thousand results, mainly from KarenT.

In addition to more conventional musical achievements, Hatsune Miku has also participated in GT racing [22] and space travel [23], so it is fair to say that Vocaloid has managed to breach the boundaries of being considered just a synthesizer program.

5. Live Performances

With Vocaloid songs having established their popularity during the first few years after the release the Crypton’s Character Vocal series, the technology was brought also to live stages utilizing various projection technologies. After a few test performances at larger events, a Vocaloid solo concert titled “Miku no Nichi Kanshasai 39’s Giving Day” was organized in the early 2010 as a joint project between Crypton Future Media, Tokyo Metropolitan Television and SEGA [24], with another show in the works to be held in Tokyo on the 9th of March 2011.

Very little information of the technology utilized in the concerts has been released, which caused a lot of confusion when Western mainstream media started reporting about the Vocaloid concerts. The Los Angeles Times reported about Miku performing as a 3-D hologram and various other news websites followed using the “3-D hologram” term. However, based on my personal observations, the technology involved is a rather innovative way of utilizing a centuries-old traditional projection technique, even if the performances look 3-D.

A transparent screen is set up on the stage, with projectors behind it. Just as using a traditional front-mounted projector on an opaque canvas, the projector lights up only parts of the canvas where the light needs to be, while black parts on images remain unlit as a lamp obviously cannot generate blackness. Utilizing the same principle on a transparent screen (with rear-projection for convenience), the black unlit pixels remain transparent, and the projected character appears to be floating, as can be seen in figure 3.

Fig. 3: Miku performing ”World is Mine” live

As figure 3 shows, the effect is very convincing, but it is not true 3-D in the sense that the image is still displayed on a flat screen and can be viewed only from one angle, unlike a true hologram which would have volume and could be viewed from anywhere. This limits the venue and audience placement slightly. I’m interested in seeing if the same technology is still used in this year’s concert or if they’ve come up with something different.

As with all synthetic humanoids, considering an effect called uncanny valley is also important with both the voice and appearance of the Vocaloid performances. A robotics doctor Mori Masahiro discovered that when something is very close to human in appearance but not quite, a strong negative reaction in people is provoked, which is what is called the uncanny valley effect. As an example, things like prosthetic body parts, representations of zombies, humanoid-looking robots, and more recently computer generated 3-D characters cause the uncanny valley effect and dislike in people. However, stylizing the characters does reduce and eliminate the negative reaction completely. This was demonstrated well in late 2004 when 3-D films “The Incredibles” and “The Polar Express” were released roughly at the same time. The heavily stylized and cartoonish Incredibles became a hit like many other Pixar films, while The Polar Express, which aimed for a super-realistic 3-D style, seemed to create the uncanny valley effect and caused great discomfort in the viewers and was a total flop. [27]

The reaction is always up to the viewer, but I find the 3-D avatars of Crypton’s characters so stylized that the reaction they generate is not negative. In fact, I noticed that watching the stylized 3-D character singing a song actually also reduces the synthetic feel of the voice, somehow softening the effect.

6. Virtual Idols

The concept of a virtual idol was already explored 15 years ago in science fiction, in works like William Gibson’s 1996 book “Idoru” and the 1995 anime film “Macross Plus”. However, both of these feature the virtual idol as a single artificial existence, while Hatsune Miku and the other Vocaloid idols are crucially different in the way that they are available for anyone to use.

The software can be purchased for example on Amazon, with the price currently being ¥13,356, approximately €115 [28]. After purchasing the software, any further music made with it belongs to the creator and Yamaha, Crypton Future Media or whoever contributed the voice samples do not require further compensation or royalties per song.

The characters, however, are owned by the respective companies that have designed them. Crypton Future Media has come up with a special “Piapro” licensing system that allows the use of the characters, even in commercial works, as long as Crypton is made aware. They have also established the piapro.jp website to encourage the users of the website to post material such as music, pictures and video they create to allow others to create further derivative material based on those works. [29] This goes well with the original spirit of distributing the songs on Nicovideo, and with the previously mentioned KarenT label, it is possible for beginning musicians to progress from free independent Piapro-licensed releases to commercial releases through KarenT on iTunes and AmazonMP3, and then perhaps be discovered by larger record labels.

7. Conclusion

It is clear that Vocaloid technology has made an impact, if not on the entire music industry, at least on a large amount of both composers and listeners. Another important thing is the observation that the music industry is not really even necessary anymore, as with the internet it is possible for amateur musicians to find their audience, even without a proper record label.

The addition of a recognizable image for the music obviously helps, as well, and the concept of a virtual idol available for everyone to use also pushes Vocaloid past the boundaries of what is traditionally considered an instrument. Liking “music that uses guitars” is a fairly broad field ranging from pop to extreme metal, while liking “music that uses Vocaloids” is somehow more defining, as Vocaloids, while practically being instruments, usually appeal to a similar kind of audiences.

Overall, I think the best thing about the technology is to allow musicians such as Supercell’s Ryo to use the image of Vocaloids to gain recognition in the musical industry, and perhaps someday to be able to work with legendary musicians who otherwise would have been impossible to contact and collaborate with.


