Jump to content


Please note: You can easily log in to MPN using your Facebook account!

Cornell University - Voice Cloning from 5 Seconds of source


Recommended Posts

"We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech from thousands of speakers without transcripts, to generate a fixed-dimensional embedding vector from seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2, which generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder that converts the mel spectrogram into a sequence of time domain waveform samples"

 

https://arxiv.org/abs/1806.04558

 

[video:youtube]

Yamaha CP88, Casio PX-560

Link to comment
Share on other sites



  • Replies 16
  • Created
  • Last Reply

Top Posters In This Topic

Top Posters In This Topic

That's really cool. Do you think he was drawn to this work because of how exactly his own voice sounds like a machine produced it?

I thought he was going jump out from behind the curtain and shout "TADA! This entire voice over was TTV from a 5 second sample of my voice."

 

Whenever you find yourself on the side of the majority, it is time to pause and reflect.

-Mark Twain

 

Link to comment
Share on other sites

Another thing that occurred to me is I wonder how it would perform across different languages? One example had an English woman with an English accent converting English text. Okay, what if they had a person speaking Spanish into the sampler? Would the English text sound like their voice speaking Spanglish? Or Mandarin to French. It would be truly amazing if it could convincingly capture the inflection, intonation, and cadence of dissimilar languages.

Whenever you find yourself on the side of the majority, it is time to pause and reflect.

-Mark Twain

 

Link to comment
Share on other sites

T-800 speaking to the T-1000 on the phone using young John Connors voice:

 

"How's Wolfie?" T-1000 (in his mother's voice) "Wolfie's just fine dear, when are you coming home?"

 

T-800 hangs up and says to John: "Your mother is dead."

 

They not only can deep fake us, they can deep fake each other.

 

Other than that fun thought, I think this is pretty cool

Hammond SK1, Mojo 61, Kurzweil PC3, Korg Pa3x, Roland FA06, Band in a Box, Real Band, Studio One, too much stuff...
Link to comment
Share on other sites

i think languages and accents, regional habits and colloquialisms, word choice, etc. would be a lot more backend work (Sourcing such a speaker to do the phrasing) before applying the other person"s timbre. Of course with public officials who are on the mic all the time it"s easy for a creative person to put together entirely believable performances - not unlike an impressionists does.

 

In other uses, I can imagine AI capturing the timbre and phrasing of say Eric Clapton"s playing and applying it to songs he never played.

 

 

Yamaha CP88, Casio PX-560

Link to comment
Share on other sites

There's actually some VERY positive uses for this too. My wife is getting her PhD in linguistics, and computer modeling is absolutely huge in the field. Even outside of voice reconstruction, linguistics uses a lot of statistical modeling in areas like language documentation, to be able to help save endangered languages which are disappearing over the planet at an alarming rate (along with cultural identity). Linguists in the field are using modeling to more precisely document and eventually retrain indigenous cultures whose native speakers are aging out. We have a friend who just got accepted to the PhD program at Cornell, very fantastic school (even if they're still mired in dated Chompskian theory at times). I wouldn't be surprised if she gets involved in this project.

Puck Funk! :)

 

Equipment: Laptop running lots of nerdy software, some keyboards, noise makersâ¦yada yada yadaâ¦maybe a cat?

Link to comment
Share on other sites

I can think of one excellent use of this. They can reproduce Majel Barrett-Roddenberry's voice and use it for all the computers on all new Star Trek projects!

 

Personally, I want her voice for Siri as well, at least on my own devices. But good luck with the rights to that.

"I'm so crazy, I don't know this is impossible! Hoo hoo!" - Daffy Duck

 

"The good news is that once you start piano you never have to worry about getting laid again. More time to practice!" - MOI

Link to comment
Share on other sites

We have a friend who just got accepted to the PhD program at Cornell
Tell your friend to drop by the Language Resource Center and say hi!

 

I can think of one excellent use of this. They can reproduce Majel Barrett-Roddenberry's voice and use it for all the computers on all new Star Trek projects!

 

Personally, I want her voice for Siri as well, at least on my own devices. But good luck with the rights to that.

If a petition starts to circulate the internet, count me in. Didn't Snoop Dogg license his voice for some pre-Siri GPS awhile back? Wait, can we get Snoop Dogg to be the voice of the Star Trek computers now? Count me in for that petition, too.

 

T-800 speaking to the T-1000 on the phone using young John Connors voice:

 

"How's Wolfie?" T-1000 (in his mother's voice) "Wolfie's just fine dear, when are you coming home?"

 

T-800 hangs up and says to John: "Your mother is dead."

 

They not only can deep fake us, they can deep fake each other.

 

Other than that fun thought, I think this is pretty cool

"WHAT IS THE DOG'S NAME."

 

Samuel B. Lupowitz

Musician. Songwriter. Food Enthusiast. Bad Pun Aficionado.

Link to comment
Share on other sites

Tell your friend to drop by the Language Resource Center and say hi!

 

What do you do there? My uncle was a prof there until about three years ago. I don't think he minds Chomsky even a little.

I handle the media (digital and otherwise) and distance learning equipment here, and manage the small recording studio. My background in literary criticism does occasionally come in handy, I say while keeping out of the Chomsky debate for the purposes of this thread. :wink:

 

I will say, I can imagine this voice cloning technology coming in handy for when our weekly podcast guests ignore our e-mails and don't schedule a time to come into the studio!

 

Samuel B. Lupowitz

Musician. Songwriter. Food Enthusiast. Bad Pun Aficionado.

Link to comment
Share on other sites

So, I can't believe anything I read, only half of what I see and now some clown can send me an offer of marriage in the voice of Jack Nicholson. Wow, that tech sure curdled fast! :laugh:

 "I want to be an intellectual, but I don't have the brainpower.
  The absent-mindedness, I've got that licked."
        ~ John Cleese

Link to comment
Share on other sites

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...