Become a Creator today!Start creating today - Share your story with the world!
Start for free
00:00:00
00:00:01
Episode 15: Speech Synthesis From Neural Signals image

Episode 15: Speech Synthesis From Neural Signals

S1 E15 ยท CogNation
Avatar
10 Plays5 years ago

Joe and Rolf discuss recent research finding that recordings from the brain can be used to reconstruct the speech that is being thought about. Getting into the prospects of mind-reading and other futuristic possibilities, they discuss some of the limitations of research in the area and what makes progress so difficult.

Source material: Speech Synthesis from Neural Decoding of Spoken Sentences by G. Anumanchipalli et al. (2019)
YouTube video of the model

Recommended
Transcript

Introduction to Brain-Computer Interface Technology

00:00:06
Speaker
Welcome to the show. Today's episode, we're going to be talking about some interesting new technology or new applications of technology for recording from brains of people.
00:00:22
Speaker
and trying to use those recordings to help people speak without actually speaking. So recording from the brain and producing speech, that's the general topic for today's episode.

Media Hype and Public Perception

00:00:37
Speaker
And it comes from an article that has been receiving a decent amount of attention in the press. And so we're going to talk about the article and also try to get behind the scenes a little bit.
00:00:50
Speaker
to what's really going on here with this technology and with this research. Very cool stuff, but I think one of those places where it might be slightly overhyped. But before we get into that, I do want to make a plug for the pod and say, if you enjoy what you're listening to, please rate and review us on Apple Podcasts. If you want to give us feedback, we'd love to hear your feedback. If there's anything you want to hear about or
00:01:15
Speaker
Things you'd like to hear differently or or what have you or just you have thoughts on the topics, especially that if you have thoughts on the topics, we'd love to hear from you. You can go to our Facebook page at cognition on Facebook.
00:01:28
Speaker
So the paper for today, so let me just describe it a little bit

Research Paper Overview and Goals

00:01:33
Speaker
more. So this is one that just came out in nature, which is one of the highest impact journals there is. And like Joe said, it's been appearing on some press recently. The name of it is called speech synthesis from neural decoding of spoken sentences. And it's by Gopala Anuma Chapali.
00:01:55
Speaker
Josh Chardier and Edward F. Chang. It's Edward Chang's lab in San Francisco. And then Anuma Chipoli is a postdoc and Josh Chardier is a graduate student. Right, so what is this paper about? This paper takes a significant amount of unpacking, I think, because there's a lot of complex things going on in this. So the bottom line, again, is they're producing a form of speech from neural recordings.
00:02:24
Speaker
So the end output of this is a sound that comes out and the input into it is brain recordings. I think it might help actually to just read the abstract because even though it's got a lot of jargon in it, this might be a good way to kind of introduce the topic.
00:02:43
Speaker
Here we designed a neural decoder that explicitly leverages kinematic and sound representations encoded in human cortical activity to synthesize audible speech. Recurrent neural networks first decoded directly recorded cortical activity into representations of articulatory movement and then transformed these representations into speech acoustics.
00:03:09
Speaker
In closed vocabulary tests, listeners could readily identify and transcribe speech synthesized from cortical activity. Intermediate articulatory dynamics enhanced performance with limited data. Decoded articulatory representations were highly conserved across speakers, enabling a component of the decoder to be transferable across participants.
00:03:32
Speaker
Furthermore, the decoder could synthesize speech when a participant silently mimed sentences. OK, again, that's part from the abstract in the paper. So let's try to unpack that and make some sense out of it and see what we're really getting at.

Methodology and Participant Details

00:03:47
Speaker
Yeah, I think the these nature articles are funny because they actually start with the results and discussion and then go into the methods. But I think it would help everyone, including ourselves, to start maybe by talking about the methods a little bit.
00:04:01
Speaker
Yeah, and I think it's helpful that I always like when I'm looking through a paper to sort of visualize what a what a person is actually doing in the experiment and sort of step by step what's happening. Yeah, absolutely. And so I think first, who are the people who are being recorded and then on the side of listening as well. So I think those are the two sets of people here that the people whose brain responses are being recorded and then those who are listening to the reconstructed speech.
00:04:31
Speaker
Okay, so let's start out with the people whose brains were recording. So who are we looking at here? So these are five participants who all have electrode arrays. They're basically called electrocorticography, ECOG arrays that are implanted in their brains basically because they have epilepsy.
00:04:58
Speaker
and they have some pretty severe, profound forms of epilepsy, and they're going to be receiving some treatment, and there's some medical reason why they need to be having parts of their brain recorded from.
00:05:15
Speaker
Basically, in epilepsy patients, you want to get a clear sense of where the functional parts of brain are still intact and where there's some area with damage so that I guess if you're excising some tissue from the brain, you don't want to take anything that's fully functional. So you want to be very precise about how you localize any possible surgery to the brain and maximize the amount that you're stopping seizures afterwards.
00:05:44
Speaker
That's right. You want to find the parts of the brain that are potentially causing these seizure events and avoid parts of the brain that are, for example, involved in producing speech, which is the part that we're particularly interested in for this paper.
00:06:02
Speaker
side effect of doing this, these epilepsy studies, everybody wants to try and get an experiment in on them because it's such high resolution and high precision brain scanning. So they have electrodes directly, they have a number of electrodes directly implanted into their brain. So this will give you a much better signal than say fMRI or other sorts of brain scanning technology.
00:06:29
Speaker
Right. You can't get a better source of brain activity in awake behaving humans than this in the sense that these electrodes are implanted subdurally. So they're right on the surface of the brain, not on the skull, but right there on the brain. And they're pretty large arrays as well.
00:06:52
Speaker
Yep, and picking up real time information exactly as it's happening. And this kind of stuff gets used. I mean, you can insert these things in monkeys, but when you're trying to understand something like speech and language comprehension and production, it's not going to help to do it on animals that don't have the kind of language capabilities that we do. That's right. Exactly. And, you know, for these participants, these arrays that they're recording from
00:07:21
Speaker
are all on the left side of the brain. I believe that. I think that's just how they just selected participants based on that, that they had electrode arrays implanted around some left hemisphere language areas. That's right. Yeah, exactly. So these are all people who have electrode arrays on the left hemisphere around areas that are important for speech, especially speech production.
00:07:52
Speaker
But yeah, some of the areas that they record from they this is part of unpacking this article to is they there's a lot of abbreviations and and jargon and it gets a little deep into here, but
00:08:05
Speaker
Some of the places that they recorded from are ventral sensory motor cortex, superior temporal gyrus, and inferior frontal gyrus. And these are intended to be areas that are either speech processing or speech production areas in the brain. So just before the speech processing commands go out into the actual focal tracts or tongue and lips and all those areas.
00:08:35
Speaker
Yeah, exactly. And the different participants have different sized arrays and in slightly different locations on the brain as well. So, for example, participant one has a very large array versus participant five has an array of maybe half that size.
00:08:54
Speaker
And I guess you just take what you work with. You can't. You work with what you got. Yeah, exactly. Exactly. So, you know, and I think that has some consequences as we look at the performance of the models for the different participants where participant one has, you know, very good results and participant five has not as good results. And then there are sort of intermediate results from the other three participants. Yeah. And I guess one
00:09:21
Speaker
feature of this too is since you're using, I mean it is a very low number of subjects you might think for an experiment in general, but this is pretty rare thing to get participants that have electrodes in these areas that are just ready for an experiment. I mean one thing that you might be concerned about is individual differences and you know whether or not this is representative of the population at large if they're patients who have some sort of neurological disorder already.

Challenges and Limitations of Current Technology

00:09:50
Speaker
So that's a concern.
00:09:52
Speaker
But I think damage is not so extensive that it would be a serious problem. Right. And I think for me, the bigger issue there in terms of transferability of the results, the way that this was reported in the literature, I mean, sorry, in the media, the way this was reported on in the media, like there was an article in the New York Times, I think, and some very prominent
00:10:15
Speaker
news outlets wrote about the results, they were really sort of saying, you know, you can record from a person's brain and produce a voice that speaks for them, essentially. And you could give a voice to people who don't have lost the ability to speak for whatever reason. Yeah. And these, you know, that that is a potential use of a technology like this in the future. However,
00:10:45
Speaker
None of these participants had lost the ability to speak. In fact, they all spoke as part of the development of the models and these experiments. And that was a key part of the process, is them talking. So these are not people who have any problems with their speech production.
00:11:08
Speaker
And so thinking about how this would translate to someone who does have some neurological deficit that is causing it to be difficult for them to speak, it's an open question as to whether this would actually work in those situations. The most exciting thing and the reason why we want to talk about it in the first place is because it does point to some potential towards this direction. Right. So those are the people.
00:11:38
Speaker
And what they're doing is they're speaking, they're reading some words and some sentences. They're taken from some standardized corpuses that are used in speech production research.
00:11:59
Speaker
So they're reading from these corpuses and they're. Some of them, some of what they're doing is just free reading so.
00:12:11
Speaker
One participant read from Sleeping Beauty, Frog Prince, Hare and the Tortoise, the Princess and the Pea, and Alice in Wonderland, and so forth. Some of the participants just read a whole bunch of sentences from a database, and one was reading from fairy tales like this. Right, so they're reading all of this stuff,
00:12:36
Speaker
and their brain activity is being recorded from these different electrodes while their voice is being recorded.

Neural Decoding and Speech Synthesis Process

00:12:44
Speaker
With the goal then of taking the outputs of the neural activity, the electrical output of the neural activity recorded at these brain surface electrodes and decomposing that into some signals that can then be recombined
00:13:03
Speaker
into sound that basically becomes something that another person can hear and interpret and understand, basically. Yeah. And I will note here that for data nerds, I guess, the sampling rate is really high. They can sample it about 3,000 hertz, which is 3,000 times a second. In contrast, you know,
00:13:30
Speaker
typical neurons are firing around 10, 20, 30 Hertz. So they're recording at a really high rate and much faster than you'd be able to record from. But much more resolution than you'd get from any other means of recording, I think. Yeah, I mean, that was amazing. I mean, think about how much data that must be. Just a gob load of data. Yeah. And I will note here, this is this was an exciting part for me.
00:13:55
Speaker
The low frequency component was also extracted with a fifth order Butterworth bandpass filter. Oh yeah, I love the Butterworths. Apparently the Butterworths are just that is the way to filter these days. I would have gone with the fourth order Butterworth filter, but they they did it. They went all the way to five. Yeah, ours go to five. Yeah.
00:14:16
Speaker
No, yeah, no, that was I was struck by that about how Yeah, just how much data they're getting at such a high resolution. It's yeah, it's amazing. Yeah massive amount of information and then they have to
00:14:29
Speaker
run this through some all kinds of filters and stuff like that and one of the main points of this article is that they're using what they call an articulatory kinematics or what they call articulatory kinematics, which is the movement of different parts of your vocal apparatus. I like how they say this. What they say is normally
00:14:57
Speaker
They would measure this per person by sticking some sensors on different parts of your of your mouth. So you put some on your lips, some in your tongue, some in your teeth, I think, and then have you talk so you can see how those parts move.
00:15:14
Speaker
For this experiment, they just used a computerized model of this. They didn't actually record from individuals how their mouths were moving. They just used a general mathematical model that had been constructed somewhere else. Right, exactly. The key thing here is this intermediate step. You can imagine a step where the way that I would have thought about doing it, which would have been obviously would not have worked,
00:15:42
Speaker
But my naive approach to thinking about this would be to say, let's take this output from the brain and look at what people said and basically try to find a model that can be directly derived from just the neural output. Right, exactly. So somehow the words themselves are represented in the neural output, which is not at all what they're doing.
00:16:11
Speaker
like, you know, decoding words per se, what they're actually doing is they're decoding the motor movements of the mouth and the vocal tract. And they're using those decoded motor movements of the vocal tract
00:16:33
Speaker
to infer what sounds would be produced by those motor movements, which is cool. I must say that is pretty cool. It's pretty, pretty cool. I mean, it takes advantage of
00:16:46
Speaker
one of the main ways that, you know, these speech synthesis algorithms are working now. So speech synthesis algorithms now that produce speech, you know, there's all kinds of different ways to do it. I mean, some of the better sounding approaches actually just mostly take recorded speech and then filter it in different ways and combine it in different ways.
00:17:08
Speaker
But you can have this is like a purely synthetic speech. So this is not like just chopping up words and putting them together. This is like really creating speech from a whole cloth.
00:17:22
Speaker
So it's like you've got a model. It's like you've got a model of a person with a with a tongue and air going through and all of the noise making apparatus and as you manipulate that model you can make words from it just like a person would make words from it. That's right exactly exactly and so you know the
00:17:45
Speaker
That is the general approach. And they basically then have two separate deep learning algorithms and neural networks. The first one basically takes the neural output and creates these kinematics. So how the mouth would move around. Right. And then the second part takes those kinematics and basically produces
00:18:14
Speaker
something that is essentially features of the sound waveform. And those features are then reconstructed into actual sound. OK, so in a way, it's like making speech more intelligible by highlighting the kinds of features that normal human speaking would make. Yep.
00:18:42
Speaker
Yeah, exactly, exactly. So connecting the dots between if there's this type of movement in the mouth, it would produce this kind of sound. And so we get more of the kind of regularities that you'd get from the way that humans make sound versus the way that sound processor a computer would make sound. Right. Yeah, exactly.
00:19:05
Speaker
And so the net results of all this is basically the algorithm is essentially learning from how similar or different these sound features are that it's produced from its model and what is actually in the recording. Well, here, let me, an example of the speech synthesis
00:19:32
Speaker
saying bright sunshine shimmers on the ocean and then an actual human saying bright sunshine shimmers on the ocean bright sunshine shimmers on the ocean bright sunshine shimmers on the ocean yeah so as you can hear from that you know there's uh it definitely produces some speech like sounds and you can understand what's being said
00:20:02
Speaker
But it's not like super clear. And this, of course, is going to be one of the better examples. They're going to share with us the the better examples. There were a lot of examples where people were not able to transcribe this correctly. So that's the other piece to this experiment.

Evaluation of Speech Synthesis Results

00:20:19
Speaker
They took the produced speech from this algorithm and they put it up on Mechanical Turk. So this is a service from Amazon that you can basically have people
00:20:32
Speaker
You know, fill out forms or participate in short experiments and things like that. And you pay them a little bit of money. It's a great way to sort of crowdsource, you know, experiments, you know, for this type. So they got a bunch of people to listen to this and basically.
00:20:51
Speaker
transcribe what they heard, but they couldn't just transcribe any old words. They were given a choice of, you know, either 50 or 25 or 10 words to choose from. And they had to basically say of those, which were the words that they heard. And there was a case where there was like one word at a time. And there was a case where there was a sentence at a time. Yeah.
00:21:12
Speaker
And so it was like a closed, they called a closed vocabulary test. And in other words, not like anything that you would be able to hear, but you have some sense of what the choices are. Even in that case, it wasn't close to perfect. I mean, they were at best, you know, getting, you know, I guess, you know, what were the exact numbers there in terms of, there were, uh, it was different for different, for different passages, I think.
00:21:40
Speaker
Some were done perfectly, but some were like 50% and some were 70%. So it was different depending on different passages, I guess. And I think this closed vocabulary bit is meaningful too, because there's a lot of ambiguity in speech. And if you're limiting what you know is going to be in that speech to 50 or
00:22:01
Speaker
you know, a hundred words, it's a lot easier to disambiguate them. Think of the Yanny versus Laurel thing where you've got that ambiguous speech. It sounds like one or the other. I think a lot of these passages when they're a little bit difficult to discriminate, if you didn't have that closed set of words, it would be pretty tricky to figure out exactly what the sentences were. Right, exactly. So, you know, this is also speaks to the quality of the decoding, right?
00:22:30
Speaker
the quality of the decoding is pretty good, but not great. I think that's sort of the takeaway. It's definitely better than random. That's the cool part. You can hear it when you're listening to it, especially when you've been told what the sound is. You can definitely hear, and it sounds like speech.
00:22:50
Speaker
Yes, exactly. And this has all been produced by an algorithm that was only listening to neurons. It's just like taking recordings from the brain and producing speech. Kind of cool. That's what it's doing. That is what it's doing. And one of the things that they pointed to, they were really emphasizing this point of using this articulatory intermediary. So the idea of really going after
00:23:19
Speaker
you know, the motor act of speaking as being like the way to do this. And one of the things they wanted to control for was that, well, okay, when these guys are actually speaking, they're also hearing what they're speaking. So you could imagine that some of the neural signal could be related to this feedback, right? This feedback exactly from the sounds that you're hearing. They're hearing what they're saying, and that could be part of the signal.
00:23:46
Speaker
Right. So then they had one of the participants is participant one, which, you know, whose data seem to be the best. Uh, they had this person mime the words. So rather than actually speak them out loud to move their lips and mouth as though they were speaking, but not actually speak. In that case, the results were pretty similar, similar, a little, not quite as good, but similar. That's right. Exactly.
00:24:17
Speaker
Which I mean to me kind of again points up a little bit the problem with the argument that they're making that this is like very nearly an assistive technology for people who have lost the ability to speak, right?

Applicability for Speech Loss Patients

00:24:31
Speaker
Because if you've lost the ability to speak, how much motor control do you have to begin with?
00:24:38
Speaker
If you don't have that motor control, how similar is your neural representation of those movements? You can imagine yourself speaking, which is basically what you'd have to do here. You'd imagine yourself speaking, but those representations are going to be quite different. Well, that's a question. That's a really good question. It's something that maybe almost used to be a philosophical question. It's certainly a linguistic question.
00:25:07
Speaker
how closely is the way that we think mapped onto the way that we speak? And there's a paper that these same researchers had done fairly recently, or some, the same lab group anyway, and I'll read a quick description of this too. I think the perspective they're taking is that
00:25:28
Speaker
They're revealing that the brain's representation of language is closely related to the way that speech is made. This description was, the new research reveals that the brain's speech centers are organized more according to the physical needs of the vocal tract as it produces speech than by how the speech sounds. Linguists divide speech into abstract units of speech of sound called phonemes and
00:25:55
Speaker
but in reality, your mouth forms the sound differently in these two words to prepare for the different vowels that follow. This physical distinction appears more important to the brain region responsible for producing speech than the theoretical sameness of the phoneme. In other words, this is an enacted cognition viewpoint that language is really all about the production of language and that our representations of language are about the production of language.
00:26:24
Speaker
So they would make this argument. I don't know that there's any evidence necessarily shown here, but I think that's the response they would have. Yeah, exactly. I mean, I think it's the distinction between internal dialogue, miming of speech, and actual speech production. Yes. I feel like those are three very different things.
00:26:48
Speaker
I think that they certainly are different things. I guess they may be just pushing the argument that there's a lot of similarity in the kinds of representations that they would have. Yeah, no, absolutely. No, there's no question. The same brain areas are involved, but I'm just thinking from the perspective of training a model and if you're training a model on.
00:27:09
Speaker
speech vocalizations, you know, that's based on the articulatory system. Yeah. And you're doing that from people who are speakers and trying to apply that to someone who's lost the ability to speak through paralysis or, you know, muscle weakening, etc. It's totally unclear to me that the representational structure of that would be similar enough to transfer the model.
00:27:37
Speaker
Well, certainly in some older models of working memory, there's the idea that something gets rehearsed in memory by essentially subvocalizing it, that when you're trying to remember, say, a phone number or something, it's the same process as saying that phone number out loud and repeating it out loud. It's just that you're suppressing the actual speech of it. So I mean, there's precedents of this as an idea of what
00:28:05
Speaker
this internal language of thought is that it's just like speaking except without the speaking part. Right. No, absolutely. But then again, clearly when you're thinking, you're associating at a different pace. Right. The point that I'm thinking about here is there's probably some feedback that you're getting from the muscles because the way that motor control works is that constant
00:28:33
Speaker
It's like a loop. Brain sends a signal to the muscle to move and it has the next action that it's queuing up in anticipation that that happens the way that it's expecting it to happen. But if it doesn't, there's feedback. That error signal is a key part of the whole representation. I'm sure it's part of this model. And if you're not actually moving,
00:28:57
Speaker
then you don't have that part of the model. That part of the model is missing. So, for example, what they didn't do is they didn't have a control condition where they had someone just think about the words. I mean, this is an experiment we can all do in our minds right now. How different is it to make the motion in your mouth of speaking the word and actually speaking the word? I guess when you're just thinking about the word, how realistic is your imagination of
00:29:25
Speaker
saying that word. I mean, you can think about it very abstractly. You can sort of have something on the fringe of awareness that you may be thinking about, or you could be visualizing yourself actually mouthing the word and saying it. So it could be a stronger trace or it could be a weaker trace, but I think clearly our thoughts are not, they don't have the same strength as an actually vocalized word. In the same way, I guess that imagining moving your
00:29:53
Speaker
Arm is different from actually moving your arm. I think this it speaks a little bit to One of the questions we were talking about a few weeks back when we were talking about the idea of you know having this brain mesh Where everyone's brains are connected the Elon Musk like the wizard hat thing. Yeah, exactly because if what you're transmitting and you could imagine filtering in all different ways you could filter it such that you are only passing
00:30:23
Speaker
the internal dialogue, or you could filter in such a way that you're only passing the internal representation of speaking. So you could basically, you know, this would be like, that'd be like the analogy here. It's like, if you think about speaking to someone, you could pass that information, and then potentially that information could be decoded. Wait, so is this a way to censor yourself? So if your minds are connected, you can still
00:30:52
Speaker
kind of hold some things off.

Future Communication and Privacy Concerns

00:30:54
Speaker
Yeah, it's like authentication code, right? It's like, all right, now I'm communicating with Raw, you know, I'm gonna speak to Raw with my mind. Well, I don't think you would want my Raw unfiltered thoughts, would you? I mean, no. There's where it's tough to make. It's gonna be hard to make sense out of it because they're so subjective. Right, exactly. That's the thing. You don't know
00:31:18
Speaker
What if it doesn't have meaning for you as an outside person necessarily? Or it certainly doesn't have the same meaning for you as an outside person as the person who's experiencing it. That's the hypothesis, I guess. Whereas speech has that explicit code. There's a meaning for the speaker that maps in some really tight way to meaning for the listener.
00:31:45
Speaker
Yeah, yeah, and I'm going to go back and say again that I think eventually in mind-brain interfaces for interpreting language or connecting people through thought, I'm going to say again that I think the best way for communicating between different minds is actual language, the language that we have, and it's kind of optimally tuned already for communicating thoughts between us, and we wouldn't get anything more by directly transferring less objective information between people.
00:32:15
Speaker
Well, that's my bold prediction that language works pretty well. Well, the language is great. No debate there, but something that just popped into my head was what if you had a system, uh, I'm thinking about like a virtual reality type system where you could speak to one another, but then you had also another layer on top. So for example, let's say that you, you know, we each have headsets on.
00:32:40
Speaker
and the visual screen, and there's music and lighting that corresponds to the other person's emotional state. If someone gets a positive emotion from something you say, maybe the lights go up and the music tempo goes up and something like this, right? And you can imagine layering on this empathic level.
00:33:07
Speaker
right sort of like emojis are right exactly visual emojis visual emojis and you could have it you could be representing it through music or or light I could see when Joe is starting to get annoyed with me yeah so exactly this is the kind of thing where I mean maybe that would be just annoying for
00:33:29
Speaker
for us, but like maybe someone who has problem interpreting others' emotions. This is, you know, we're on a podcast, so we're just listening to each other. So we can get some cues to what the other is feeling, not as much as if we were seeing each other. Right, because we'd have all those micro expressions and all that verbal cues and maybe hands waving and stuff like that. Right, exactly. And some people have problems interpreting those cues.
00:33:58
Speaker
And there's a whole spectrum of people in terms of how well they're able to interpret those cues. At some end of the spectrum, you have people who are autistic who have very difficult times understanding what the other person is feeling based on their micro-expressions. And you can imagine a system that would represent that in a different way.
00:34:22
Speaker
Yeah, that there's one there's one use case. I guess. Let's just say there's this there's potentially other use cases, but I agree that idea. That's a probably takes some time to adapt to it and get used to it.
00:34:34
Speaker
Right, you'd have to learn the language, right? You'd have to learn the representational schema. And it might be different from person to person that you're communicating with too. Right, right, right, exactly. And then I guess that sort of raises a question of like how useful that is in the sense of like, would that person who's receiving those cues, how well would they be able to use that information?
00:34:58
Speaker
I think that that becomes a question a little bit of like what where the deficit is, right? Is it a deficit of interpretation of like representation of the other of the other's emotional status? Or is it conceptually appreciating like then what to do with that? Yeah, I don't know. I don't know either. I don't know. That's a question of utility. But either either way, it's like there's there's something you could do there. There's a tool that could be could potentially be used.
00:35:25
Speaker
Yeah, I like that idea. All right. Let's patent that. Let's patent that. OK. Yeah. All right. We just did. I think we just did. We just did. Yeah, we just did. Actually, what we did is we did the opposite. We just disclosed that to the whole world. So we lost all ability dependent. Ah, crap. Maybe let's take a break and then come back and we can talk about some of the implications.
00:35:58
Speaker
So we have to do some far future speculation on this stuff. What Robopocalypse stuff is in here? In this situation, it's a little bit specific to someone who's got an electrode array implanted on the surface of their skull. So if you don't want that, the implications are not, or you don't need that, the implications for you are not so direct right away. But you could imagine a future where
00:36:28
Speaker
For example, everyone has one of these electrode arrays implanted in their skulls. You know, maybe that's a thing. But when the Elon Musk thing comes through. Right, exactly. I mean, isn't that, that's kind of what he's thinking, right? Is that, yeah, there's like implantables. Yeah. Yeah. You just inject them in and. Right. You inject them in. They form some sort of net over the brain and you can read and write. That will be, that will be ready next week. And, uh,
00:36:55
Speaker
And that'll be very looking for volunteers for beta on that one. But, yeah, I mean, you know, you know, even without that, you can imagine like other systems that maybe are a little less invasive or how it works in the future. It potentially could be the case that you could be opted into this program rather than yourself opting into the program.
00:37:16
Speaker
And then the system would be able to read your thoughts and minority report style. You get arrested for a crime you haven't committed yet or may not ever actually commit. Yeah. And I think even just having your thoughts read in general is pretty invasive. Right. I mean, we already have Twitter. That's already too much. Yeah. How much more do we need to hear unfiltered
00:37:46
Speaker
Random thoughts of every person. Good point, right? That's already plenty unfiltered.
00:37:51
Speaker
Yeah, it should be just the absolute next level and completely unfiltered. I was listening to this podcast about the Chinese system of recording just video cameras everywhere, just surveilling the whole population, trying to get everyone in the whole population under total surveillance. And that's already pretty terrifying. And that exists today, which is really quite an amazing thing if you think about it, how much bandwidth and
00:38:19
Speaker
just compute power they're using to make that happen. There's a bit of a fundamental clash there too, because if what you really want is privacy, and we're running into all these privacy issues with Facebook and Twitter and all that stuff, if what you really care about is an individual's freedom to do as they want, then this is as invasive as you get moving forward. Right, exactly. So if your belief is that you should have total control over the population,
00:38:49
Speaker
by absolute surveillance. Big brother. Yeah. What would stop you from wanting to get inside someone's head? There's nothing to stop that. So if you can decode someone's speech before they actually say something, and you can literally hear what they're thinking, that's the output of this system, right?
00:39:15
Speaker
I, you know, if it works, I want to think about that now. Think about that for a second. Back to what we were talking about before, the articulatory versus the representational versions of this model. Right. So if this is a tool that someone's using because they want to regain the ability to speak, how much control are they going to have over whether it's on or off? That's right. They would have that.
00:39:42
Speaker
You want to be able to express what's on your mind, but not with no filter, right? You don't want the ability to gain speech back to come with letting everything out. Yeah, you want to build the ability to be quiet. It has to be on one side of the sort of internal filter that you have. Right, exactly. I mean, that's actually in some ways like another big advantage to the articulatory process model, right?
00:40:10
Speaker
Because, I mean, they actually did interesting. Now they think about it. They made a big point. This is probably not why they made this point, but they made a big point in the paper about that they can decode silence as well. So I think that's super interesting, right? That if you're going to encode and decode speech, it's super important to be able to encode and decode, not speaking.

Speech Technology: Human Vocalization vs. Direct Neural Signals

00:40:33
Speaker
Otherwise you end up with this unfiltered. All this stuff that you're thinking about, but intentionally not saying.
00:40:38
Speaker
Right. You just don't have any desire to have it be spoken. One other thing that this paper made me think a little bit about was, so we talk about biomimetic approaches to all kinds of things. So biomimetic just imitation of biological mechanisms for doing things. Like, you know, biomimetic flight would be like looking at birds, see how they flap their wings, and
00:41:07
Speaker
figure out how that works aerodynamically and maybe you can find a better way to do it than engineers have done it before. One of the things that I thought about here is all of our vocal track, all of this apparatus that we have to make speech is essentially kind of a wasted way to do things because you can make the same sound through a much more compact electronic source, just a speaker and
00:41:36
Speaker
some kind of production unit. And you don't need to articulate things using all of the ways of shaping sounds that we have. You can do it all electronically. So what use is knowing all of this vocalization kinematics to making a robot? Well, I think the argument would be that the advantage is that the person who's listening
00:42:04
Speaker
to the robot is a person. And so they want to hear a speech that is quote unquote natural sounding. And then having this understanding of how this speech is produced from the articulatory system could be helpful in recreating these nuances. You could conceivably, I guess another related question is, could you learn to understand that
00:42:32
Speaker
that kind of speech that isn't produced in this sort of way. In other words, could people do just as well on the speech that was directly translated from the neural signal as they could on the one that went through this articulatory kinematic process? Could you learn to do just as well on that? And if so, I mean, maybe it's just a bias that we carry around with us that it's easier to understand things that have these regularities, but maybe we could learn it just as well the other way.
00:43:02
Speaker
Well, I think there's different speech models, basically, that you could produce that would have different results. The vocoder approach, which is basically a much more algorithmic approach, doesn't sound very natural. And it's harder to understand certain words. I mean, the way that you would really want to do it is to make it sound really good. And I think this is the way that these systems largely work now, is you have a lot of voices that have essentially been recorded.
00:43:30
Speaker
saying a lot of different things. And then you're just trying to get the right units of manipulation. So is it at the level of the word? Is it at the level of the phoneme? One word that's followed by another word that starts with a certain plosive. Depending on how that works, if you have the right combination in your database of already articulated speech sounds, then you just basically pull those units up and stitch them together.
00:44:00
Speaker
Well, that's the way you get really good sounding artificial speech now. It's just basically stitching together these pre-rendered units in the right pattern, not doing this articulatory business at all. Yeah. Yeah. Well, I mean, I guess that's the way to constrain things. I guess I just sort of wonder how necessary is it when we're thinking about, uh, you know, a million years in the future as we're replaced by robots, how necessary is it that we have
00:44:26
Speaker
vocal apparatus that basically is a hose that shapes sound and that shapes escaping air in a certain sort of way rather than just producing it through a speaker. Oh, no, I mean, yeah, you could you can do a lot. If instead of having, you know, using your mouth to speak, you used like a speaker, an amp. You'd be a much better singer. I'm a yeah. Yeah, you could do all kinds of cool stuff with it, right? Yeah, you could.
00:44:56
Speaker
You could do all the pitch correction and. Completely, everyone would have a full vocal range. Yep. You could even have, and you could even have a full like backing band, like right there in your. Yeah. In your head. Yeah. Yes. That would be cool. Maybe that's the future. That's the future. Everyone's just walking around like with the whole, a whole like orchestra.

Conclusion: Achievements and Overhyped Expectations

00:45:19
Speaker
Music I like that that's I mean you could you could I'm sure somebody must have must be doing research on how you make music with these recordings Well, let's see do we have any other Important thoughts we need to get out about this paper No, I mean I think You know, the only thing I would say, you know to just sort of wrap it up is that first of all, it's super cool I mean, it's it's amazing
00:45:46
Speaker
the different pieces they had to pull together to make this work. I mean, they define these patients who had these arrays, they had to, you know, have them speak, read these books and these corpuses.
00:46:00
Speaker
some pretty sophisticated computational models here. Yeah, two super complex deep learning algorithms as well as these really strong understanding of speech vocalization feature sets and these articulatory models.
00:46:21
Speaker
And then reconstruct them so I mean super despite any small Work complaints that we have about the future of this. This is amazing work and really cool stuff. Yeah, Kapala definitely gets a PhD for this one Yeah, if that person does not already have one. Well, I think he's a postdoc I think he's a postdoc super cool stuff super cool stuff, but also I think overhyped nevertheless Right
00:46:49
Speaker
I mean, the popular press representation, I heard someone say, you know, one of these authors saying like, you know, now you can give a voice to someone who doesn't have a voice. Well, that's not really we're not there yet. That's a good idea. It's a good idea. But for all the reasons we discussed, we're not there yet.