Stephen Wilson  0:06  
Welcome to Episode 27 of the Language Neuroscience Podcast. I'm Stephen Wilson, and I'm a neuroscientist at the University of Queensland in Brisbane, Australia. My guest today is Jean-Rémi King.  Jean-Rémi is a CNRS researcher at École Normale Supérieure, currently detached to Meta AI, where he leads the Brain and AI team. He's been doing some really groundbreaking work using large language models and deep learning to investigate the neural basis of language. Today, he joins me from Marseille, France, and we're going to talk about three recent papers from his lab. Okay, let's get to it. Hi JR. How are you today?

Jean-Rémi King  0:42  
Hi, I'm good. Thank you very much for having me.

Stephen Wilson  0:45  
Well, thank you very much for joining me and it's morning time in Paris. Right?

Jean-Rémi King  0:51  
Right. I'm actually Marseille. I live in Marseille although my work is in Paris. 

Stephen Wilson  0:56  
Oh, really? Do you?

Jean-Rémi King  0:57  
Yeah.

Stephen Wilson  0:57  
Oh, Okay. So that'll be very convenient for you to come to SNL later this year, then won’t  it?

Jean-Rémi King  1:02  
Absolutely. 

Stephen Wilson  1:03  
Okay.

Jean-Rémi King  1:04  
I won’t need a hotel.

Stephen Wilson  1:06  
All right. So, are we going to enjoy visiting your, your town?

Jean-Rémi King  1:12  
Sure, yeah. I think it's a beautiful town. It's by the sea. The city is very nice, there is, there are a lot of museums, a lot of scientific groups that are definitely related to language and the neurobiology of language. So I think it's a it's a great city to, to, to do SNL.

Stephen Wilson  1:30  
Yeah, I'm really excited. I'm really looking forward to it. So, I usually start the podcast by asking people about their childhood and background. But with you, I want to ask something else first, which is, so we have a mutual colleague, Anna Kasdan, and she's told me some, lots of stories about you, and my, but my favorite one is that you have installed Linux on her computer. (Laughter) She couldn’t tolerate it. Is this true?

Jean-Rémi King  1:55  
I don't actually recall specifically, (Laughter) but that looks, yeah, that sounds completely possible. Yup. 

Stephen Wilson  2:01  
Yeah, so, you're a Linux guy she said. But I know that you're doing the recording on a Mac right now.

Jean-Rémi King  2:07  
Yeah. So I'm currently working at Meta, where Linux was not an option. It was either Windows or Mac. So I had to, I had to go for for Mac, which is based on Unix systems. So it's, it's a bit easier to, to accommodate from a Linux background.

Stephen Wilson  2:21  
Okay, so you've sadly had to move away from it huh?

Jean-Rémi King  2:25  
Absolutely.

Stephen Wilson  2:26  
Yeah. Okay, so then getting back to, how did you become the kind of scientists that you are like, were you, what were your childhood interests?

Jean-Rémi King  2:36  
Oh, wow! That’s a, that's a big question. I don't entirely know. I think it's, it's, it's a lot of different factors that are involved. I was originally interested in, in AI, when I was quite young. And it started, I think, when I was playing with Legos, there was the Lego Mindstorms at the time that you could program was very rudimentary sort of program that you could, you could do. But that, that made me interested in the topic and in programming, and I did a first internship in an AI Lab in 2000, I think, if I recall correctly, and then I continued in this domain during my undergrad where I did AI and cognitive science. And around that time AI was not really, let's say, working and so the advice from my professors at the time, was to sort of try to do something else, maybe something with a real future. And so was…. 

Stephen Wilson  3:47  
What years would that have been? 

Jean-Rémi King  3:48  
So it was around 2007, I think, at the end of my undergrads and so I moved to Computational Neuroscience. So, I did a two masters in, in, in Brain and Mind sciences, in between UCL and Paris, at Ecole Normale Supérieure and UPMC. And then I continued in Paris did a PhD in, in neuroimaging and to try to decode brain activity from healthy participants and, and from patients who suffer from disorders of consciousness.

Stephen Wilson  3:48  
Yeah.

Jean-Rémi King  3:54  
And that's, after I decided to do a gap year and after that, I moved to language in New York, for a postdoc in David Poepple’s lab. And then I joined Meta. I got a position at Ecole Normale as a CNRS researcher and now I'm detached to Meta AI which is a fundamental AI research lab that Meta has.

Stephen Wilson  4:54  
Yeah, so most of our listeners will probably know Meta as the parent parent company of Facebook, and I don't know how many like I mean, I'll, I'll ask you about it, but like, I don't know whether Facebook has like more rules or Meta has more rules on talking about things than like a university would, would. But you know, so feel free to share what you can. But you know, what is their, what is their long term, are they sort of have a long term interest in supporting basic research like this, they see it as being like, central to their future? 

Jean-Rémi King  5:23  
I think Meta like any big companies is very conscious of the potential of AI and the pressing necessity to be at the cutting edge of the of the research because things are moving extremely quickly. And the, gave, I think, a lot of opportunity to scientists like Geoff Hinton, in the case of Google or Yann LeCun, in the case of Facebook and now Meta, to build the lab in a way, which would work and, and both of them in this particular case went for basically the principles of academia. So, the general idea is to say that, it's very difficult to know what's going to work, it's even harder for, let's say, the hierarchy to know whether a researcher is doing something good or not. And so the best way to, to evaluate the progress in research is basically to go through a peer review and anonymized peer reviewed and publication. And so I think the really managed to convince these big tech company to, to follow these principles. And so in this case, the the long term future is very hard to, to know. And I think no one really knows how to position themselves. There are a lot of questions, it can be a case by case issue. But what's clear that you need to have sort of the top researchers within your company to be able to compete and to develop the algorithms that will work tomorrow for their use case. But researchers wants to work on general, general principles. So…

Stephen Wilson  7:04  
Yeah, because the, you know, the fascinating stuff you're doing that we're going to talk about in a moment, like in relating these large language models to the brain, I mean, it's not immediately obvious how that gets built into a Facebook app. Right? So they, they're willing to kind of give you free rein on doing what you think is, what, what you think what you want to work on and they'll and they’ll  see down the track, like, what it evolves into? Is that kind of the philosophy?

Jean-Rémi King  7:28  
Yeah, absolutely. I think the philosophy is to hire really good people and then to consider that they will make the choices that are the best one for their fields. The, there are some projects, which are directly applicable, let's say to, to products. In the case of computer vision, this is quite clear, like if you want to filter hateful contents, and let's say, pornography on Facebook, you need to have an algorithm that can recognize the content of images. And so those who work on fundamental research, in the case of vision have a direct impact, even though they don't necessarily work with actual Facebook or Instagram content, the path is, is much clearer. And at the other end of the spectrum, they are researchers who are really sort of distant from any application and the goal is really to try to understand the principle that allows a system to become able to learn much more efficiently. And so I would be much more in this, in this kind of other end of the spectrum. Yes.

Stephen Wilson  8:41  
Yeah, that's really interesting. And do you do you interact much with other AI folks at Meta who are doing more applied things that have sort of nearer term applications? 

Jean-Rémi King  8:54  
So, the lab is quite horizontal. So, we do have meetings together, we have regular moments to, or occasions to discuss. We try sometime to bridge, to bridge projects. So for instance, for those work on language models, we try to discuss on what kind of architecture we think is more relevant to try to learn language at scale. And so this engages a conversation. And similarly in the case of vision, like we have an ongoing project in, with a group that works on DINOv2 which is a self-supervised learning algorithm, trained to recognize, let’s say, structures from, from natural images without, without supervision, and we have an ongoing discussion on how we can use neuroscience to try to improve or evaluate these models which can be very difficult to, to do. So, so we have interactions but more I think at the ideas level and sometimes, some, some, some, yes, some coding project that, that I'll share together. But generally speaking is more than intellectual level of, of collaboration than a very sort of, let's say product oriented based collaboration.

Stephen Wilson  10:17  
Yeah, but I think that's a very important level of interaction too. And, and like, there's, you know, there's all there's a history of this, right, like Bell Labs in the US, you know, developed, I think, information theory came out of like, I think Shannon was working there, you know, but they definitely produced a lot of stuff that ended up like being like pretty core cognitive science. And it wasn't like, just like what you're saying it wasn't being done in support of like, we're gonna put this in tomorrow.

Jean-Rémi King  10:41  
Absolutely. Yann LeCun actually worked in Bell Labs and I think his, his philosophy and how to organize research in the private sector is heavily influenced by this. Bell Lab’s I think, had four or five Nobel Prizes, before they close down. So they really had a major impact and the core way of, the core organization was really to let researchers do whatever they wanted, and whatever they thought would be impactful. So I think, yeah, this is sort of what Yann LeCun managed to, to instill within this company, I think it was kind of the same for, for Google and other companies made many different choices like Amazon, and Microsoft, work slightly, slightly differently.

Stephen Wilson  11:30  
And how did you land this job? Did they come for you? Or was there a posting and you were like, I'm gonna apply to that?

Jean-Rémi King  11:37  
Yeah, they reached out to me, actually, back in 2018. I was surprised. I mean, like you, if I understand your question correctly, it's, I was wondering whether I had a direct utility or relevance for, for  their goal. And it's really, I think, the big argument that convinced me to join them was the fact that they were working on really an open source approach. So they were publishing papers, they were releasing code, releasing models. And I thought that this was a healthy, healthy path to to create common good and to continue, do good research. And then once I, I joined the lab, I was very impressed by the level of the researchers there. It's a really a top AI Lab, the conversations are always extremely, extremely useful and with a lot of intuition. Their trials doesn't work as well. They don't, don't always work, let's say, and you learn a lot from those failed attempts. So yeah, this environment was very fulfilling, in a sense.

Stephen Wilson  12:53  
Yeah, that's great. It's so interesting to talk to you about this, because, you know, most of my guests are people in academia and it's, that's the work environment that I'm familiar with. So it's kind of, you know, just neat to hear about what it's like for you. So yeah, like, let's talk about some of these papers that we plan to talk about. These are a couple of a few recent papers that you've published with some of your students, including Charlotte Caucheteux. Is my pronunciation acceptable within the bounds of my, my accent? And yeah, we will talk first about a paper it's called ‘Toward a realistic model of speech processing in the brain with self supervised learning’, published in NeurIPS 2022 by Millet probably Millet (different pronunciation), I'm guessing, should be French.

Jean-Rémi King  13:49  
Yeah, Juliette Millet.

Stephen Wilson  13:50  
And Charlotte Caucheteux and yourself as senior author. And we want to start with this one, because this is one of the one of the first papers from, from your group in which you kind of establish these correspondences between large language models and neural activity, right? 

Jean-Rémi King  14:10  
Yes. So, there were papers before, before this one that, that showed some similarities between deep nets and, and the brain. So perhaps I'm going to backtrack just a little bit.

Stephen Wilson  14:23  
Sure.

Jean-Rémi King  14:24  
So maybe, maybe just for the anecdotes when I was a students back in the day, and I think this idea continued to be true for, for a while. The notion of an artificial neural network was really considered to be metaphorical. It's like we say, okay, we speak about artificial neural networks but these are just, this is just a loose analogy, this, this has nothing to do with what the brain does. These artificial neurons, it just sort of computational units that were kind of inspired from neuroscience, but really, they didn't work in the same way. And I think this has switched or pivoted radically in the fields, around 2014, where several labs, especially coming from vision, started to compare deep nets to brain activity. So yeah, the lab from  Nikolaus Kriegeskorte, from Marcel van Gerven and from James DiCarlo, from, Bertrand Thirion which pretty much all simultaneously compared the brain responses to images, to the activations of AlexNet, which was sort of one of the landmarks in computer vision models. And, and thereafter VGG 19, which is another, another computer vision model. And what they show is that you, with some fancy linear algebra, either based on so called RSA, or linear mapping, they show that you can find similar type of activations in the brain, and in the deep nets. So if you present an image to the algorithm, the algorithm sort of combine the pixel together creates new activation in order to identify whether there is a cat or a dog in the image. And when you present the same image to participants, and you measure them with fMRI, or in the case of monkey electrophysiology, you record this spiking activity, you can see that basically, you can find biological neurons or voxels, which respond similarly to different images to, to the artificial neurons in the deep nets. And there were a lot of friction at the time, but I think people started to understand that perhaps, these deep nets, which were algorithms, may transform visual inputs, to some extent in the same way as, as a brain. And, and so we should not think of those as just a metaphorical model, but perhaps we can actually start to think of those as, as useful model for, for the neuroscience of vision. And around the years and many other fields, tried a similar idea in other domains, in the domain of spatial navigation for hippocampal place cells, and grid cells in the case of motor control, in the case of auditory processing, and in the case of language and speech. And so we sort of fit, I think, in the group with this general tendency of a systematic comparison between the pending algorithms and the brain to try to see whether indeed, these algorithm generates representations activate themselves similarly to the brain, in response to the same sentences in response to the same sounds. And so that was sort of the starting point. But the, the, one of the motivation behind the work was to insist on some potential differences. And, and in the case of language, the one of the key difference that is very quickly obvious is that first language models work in the text domain. So the input is, is already sort of a word, it's not quite a word, it's a token. So it can think of this as as a morpheme, really, is subordinate. And so that's first difference. And the second difference is that the get trained with just a gigantic amount of data, if you train them with small amounts of data, they just perform extremely poorly. And so in this particular that work that we've done with Juliet Nia and solid Cousteau, we're trying to think to test whether we could go towards more biologically plausible architecture that's trained with the raw audio waveform, and where's the sensible amounts of data? And for this, we we focused on an algorithm that was developed at MIT, actually, by Alexei Daisuke, who used to be a colleague of mine, and, and his group. And it's called a web directory, it's an algorithm which is input with a waveform. And it tries to do two things, it tries to predict missing bits of sounds a bit similar to to a language model where you try to predict the next bit given the context. But it also tries and that's, that makes the whole thing much more complicated. He also tries to learn what should be predicted in the first place. So you have this sort of dual goal in the algorithm you need to learn to predict and you need to learn what should be predicted. And it's this dual goal, which is sort of very hard to optimize. And so yeah, so we thought okay, maybe that's a plausible candidate, because now we can train an algorithm with a row speech waveform without supervision. And with we tested in this case, to train the we train the algorithm with 600 hours of speech data, which is very roughly about a year exposure of speech for, for a human being. 

Stephen Wilson  20:00  
Well, it depends on how talkative the parents are. Right?

Jean-Rémi King  20:02  
Exactly. It depends. It depends on other things, teenage …

Stephen Wilson  20:06  
Ok, yeah.

Jean-Rémi King  20:07  
Yeah, absolutely.

Stephen Wilson  20:09  
So yeah, so I'm seeing like, there's a couple of different innovative aspects like you know, you're doing, you're doing it on the raw audio signal rather than on the tokenized transcribed language. You’re doing it with a smaller amount of training data rather than than training the large language model on like the entire internet and all the history of all human thought. Those I understand and then the one that I don't fully understand from your, from reading your paper is the self supervised aspect, which you just kind of alluded to there. So I don't understand what that means for it to be learning what it needs to learn, I was wondering if you'd be able to explain that?

Jean-Rémi King  20:47  
Yes. Let me try to unpack this. So, perhaps the best thing to do is to start with what it is not right. language models like GPT, they are unsupervised, right? You don't need to have a human labeler that says, this is what you should do for this sentence, this is what you should do it for this sentence. So the way this works, is to try to predict missing bits of the data. In the case of GPT, the missing bit is always the last word given the context. So it's basically trying to do autocomplete next word prediction. And it's without supervision, because you can just roll a Wikipedia or the entire internet to try to predict what is the next word given the 2000 preceding tokens or 2000 preceding words. And so that's unsupervised, but what should we predicted here is determined by the experimenter. We asked the algorithm to predict the word level, or in this case, the subword level, but it's a fixed, it's a fixed level of representations. What we don't ask the algorithm to do, is to predict, for instance, the next idea, right? Or the, the next narrative structure, right? We ask it a very concrete and well defined goal, which is what is the next word. And the reason why we do this is because it's very well defined, we actually know the ground truth. So we, if we go and check, we can say the next word is actually x or y, and not Z, as you predicted. So for instance, if you have once upon a, that starts in a sentence, he asked the algorithm to make a prediction, what is going to be the next word, is it going to be table, is it going to be dog, is it going to be time, and the algorithm has to guess that it's more likely to be time than dog because once upon a dog is unlikely given the corpus with which has been trained. So it's well defined. But as soon as you go in other modalities, and that's the case, for vision, it's a case for audio, you realize that this approach is not practical. So in the case of vision, if you try to predict the next pixel, given all preceding pixels, at the beginning, you do quite well. So if you have, let's say, the first half of your image, and the beginning is a zebra, what the algorithm will try to learn is to predict, okay, this is a black stripe, so I'm going to try to continue the black stripe, and then should probably be a white stripe. So, I'm gonna go white. But then it becomes non, non determined. And so what it's going to try to do is basically to try to predict something, which is half of the time black half of the time, white, and so it's going to predict, basically gray, which is the wrong prediction. And the reason for this is because what it should be doing is to try to predict a high level feature not defined in a deterministic, deterministic, deterministic fashion, at the pixel level, but determined at a higher level, which is, in this case, the notion of texture Or stripe. And so in the case of vision, people, I think have understood for a while that staying or forcing the algorithm to make prediction at the level of the inputs is not practical. Same for the audio, like if you will try to predict the next amplitude of the waveform, which is sampled at 44 kilohertz, that's going to be very, very painful. Because you're doing this job like, it takes a lot of compute just to predict every single sub-millisecond. And so, what is being done these days, this one possible path, which I think is promising, is this idea of self supervision. So in the case of self supervision, you also learn the level of representation, which has a chance to be corrected, to be predicted accurately. So to take the example of the zebra and the stripe, basically, you ask the algorithm to find a representation such that you can predict accurately what's going to be in the next 10 100, 1000 pixels. Right. And so, so in the case of speech here, that's precisely what happens. It's a deep nets which, for which you, you input the audio waveform, the audio waveform is transformed. And then at some level, it generates a categorical representation, quantized representation. And then the deployed continues with a transformer and the transformer, the goal of the transformer is to predict this middle representation that it learns in the first place. And they are trivial solutions to this problem, which is to, for instance, pretty constant values or just predicting zeros all the time. And so you have tricks to try to avoid this collapse. And those tricks are basically contrastive learning tricks where you, you try to make a prediction such that if you have several elements in your batch, you would, you would find the right, the right prediction amongst the different elements, which prevents predicting the same thing all the time. Perhaps this is going too much into the details, but the basic idea is that the standard, let's say autoregressive, and VAE models, they are evaluated at the end of the day at a fixed level of representation, which is determined by the experimenter, whereas the self supervised learning algorithm, they have to not only learn to predict, but they also have to learn what level of representations are likely to be, to be predicted. So it's a dual, dual problem, which is harder to learn. Yeah.

Stephen Wilson  26:38  
Yeah. That was a great explanation, I think I understand it a lot better. So what, how big are these chunks that end up getting predicted? Like, are they at the level of phonemes? Or morphemes? Like, where, or is that? Can you think about it that way or no? 

Jean-Rémi King  26:56  
So they are defined in, as with a time constant. So they're not defined functionally. So I don't think we can directly associate them with phonemes and morphemes or words, but what we can say that they are in the order of 100, 200 millisecond. So they are slightly below the phonetic units. And there have been in a lot of experiments on what level of like, what timescale should be best for, for this learning, as evaluated with downstream tasks, if after this, you do, let's say, a speech to text task, you get better by training with longer or smaller units and the authors have converged on this relatively small temporal scale. And I think this is, this has to do with the fact that the algorithm learns only one level of representation. So it's not predicting at the row waveform, it's predicting at a high level, but it's still one level of of representation that is trying to predict, which is around the phonetic level. But we touched on on this issue a bit in a different, in a different article with Charlotte Caucheteux actually in a paper that was recently published in Nature Human Behaviour, which is really tapping into this idea.

Stephen Wilson  28:14  
Yeah we can talk about that. Yeah, yeah. Okay, I mean, maybe these things are about the size of syllables if they’re that length, maybe they're a little shorter than syllables. Okay, so, so then in this paper, we're talking about the, the NeurIPS 2022 paper, you then show which brain areas, in which brain areas can these predicted representations track with the bold responses after a suitable convolution with the HRF. Can you tell us what you see there?

Jean-Rémi King  28:49  
Sure. Yeah. So So what we try to do is to quantify these functional similarity between the deep nets and each voxel with a linear mapping. So basically, what we do is we learn a regression from the activations of the deep nets to the voxel activations, to try to see whether we could, we can accurately predict whether the voxel is going to going to be high or low given the speech sound given the activations of the defense, and that's a pretty standardized approach these days, which was I mean, it's GLM in fMRI already based on this idea, but then they were formulated for this goal in the paper by Jack Gallant and Naselaris in, I think 2011. So, the method is basically linear algebra. I'm not gonna go too much into the details, but it gives us one number, which is for each voxel, how similar is it to the activations of the deep nets. And the first thing that we do is that we do this similarity analysis for each to layer on the deep nets. So deep nets is organized hierarchically, so we have the first layer, which just take the raw waveform and this representation is passed on to a second layer, which, again, transform the representation, which is passed the third layer, and so on and so forth. And so for each layer, we have a set of activations that we can compare to each voxel. And what we observe is that different voxels in the brain are more or less similar to different layers in a deep nets. And the striking observation, when you look at the, the, the overall result is how structured this similarity is. So if you look at A1 responses, you basically get activations, which are most similar to the very early layers or the transformer in the deep nets. And the further away you go from A1, and the more the activation that's being recorded with fMRI gets similar to deeper and deeper layers in the deep nets, such that if you go to temporal pole, or to the temporal parietal junction, or the prefrontal areas, you end up with voxels, which are most similar to the deepest layers in Wav2Vec 2.0 And it's very, it's what is really striking is how monotonous this relationship is, it's like, the further you are, from A1, from a sort of a direct path distance, the more your representations appear to be similar to deeper and deeper layers in the algorithm.

Stephen Wilson  31:36  
Yeah, if I like the just to kind of try and help, you know, audio listeners visualize it, I mean, this is figure three in the paper. And I, to me, it resembles like kind of concentric circles coming out of A1, right? So in A1, you've got, prediction being most successful from the like you said, the earliest layers that the most superficial, most similar to the input. And then like, as you said, the further you go out, it's almost like these concentric rings, as you go into the temporal lobe and into the inferior parietal lobe. So like the Angular Gyrus, it kind of gets to further and deeper and deeper layers, but it's not completely concentric, because it doesn't just randomly go into the insula, and it doesn't just randomly go into the sensory motor strip, right? It very much goes out into the temporal and inferior parietal regions, and also frontal.

Jean-Rémi King  32:24  
Absolutely.

Stephen Wilson  32:24  
Which is kind of noncontiguous to the frontal. So it's, it's very much like you said, like, it's a beautiful figure, by the way, it's like the heart of the paper, but it's obviously capturing like, something pretty basic about the structure of the language system.

Jean-Rémi King  32:41  
Absolutely. I was, I mean, when I saw this, this figure, when we were playing with the data, I was instantly shocked. I was like, wow! you don't usually get this in, in fMRI, like, my experience with fMRI before is you get like this contrast between I don't know Jabberwocky, and meaningful text, and you end up with a blob, or let's say, a set of blobs, which are different, depending on the contrast, and it's very difficult to make sense of these things. Whereas here, the map is remarkably smooth and, and continuous and simple to describe it in a sense. And I think the reason for this is because we are working with a large number of participants that were made available from different groups. So in this case, I think they were with more than 400 participants listening to natural stories, and it's really the the big numbers that I think allow retrieving this, this very simple, simple structuring of the language processing in the brain. But it's not just a concentric circle, either. Because if you look at the prefrontal cortex, you have this very interesting sort of gradients within the prefrontal cortex where you have a stripe that starts from the motor and premotor areas and goes towards IFG, and within the interior frontal gyrus, you also have some, some gradients, which I think could make sense in light of anatomy, because we knew that different parts projects to different, different parts of IFG project, through the white matter tracks to different parts of the temporal lobes. And if you take close attention, you will see that these actually match with our expectation. So it's, it's a very striking figure I find.

Stephen Wilson  34:27  
Yeah. 

Jean-Rémi King  34:27  
Because, so not only because visually, but also because of what what this means. And so when I listened to my previous postdoc advisor, David Poepple, I, sometimes I hear him sort of criticize the whole approach on being too technical and too fancy and sort of forgetting the ideas but here and the criticism being that's okay, but these are models with billions of parameters and you just doing a huge regression and we don't understand anything in there. But here what I find The striking is that the optimization function is very simple to describe. It is just one equation, you say, you have two things to do, you have to learn to predict, and you have to learn what should be predicted. And this is the goal, right? And this is sort of the essence of it all. And if you do feed the algorithm, if you follow the algorithm to optimize this function, then it naturally comes up with a hierarchy of representations, which seems to provide a very strong structure, or at least, is seemingly simple or simple enough organization of speech processing in the brain. And to me, that's quite, quite quite striking.

Stephen Wilson  35:39  
Yeah. No, I just, I mean, it's not overly complicated when it, well I mean, I think it's probably very complicated to implement. But like you said, that there is a simplicity to it, as well and when it gives you a result, that makes sense. It's definitely reassuring. You know, I was a little bit like you mentioned in the paper too, like that you do interpret these gradients in the frontal areas as well. And I looked at it a bit, and I was like, you know, I think you might have to, like, do a bit of convincing for me there, because I get it with the temporal, temporal parietal thing is just pristine. And, you know, it's not like, it's not like trivial, either, because, you know, there was like, you know, big debates between, like, you know, Greg Hickok, and Sophie Scott, for instance, as to whether the predominant direction of processing in the temporal lobe was headed, like anteriorly, or posteriorly, from, from A1, and your data basically shows well, there's no winner there. They're actually both right, because it's going in both directions. So, you know, I do think it's actually addressed, you know, this data is not just a pretty picture, it's it does address like open questions. But I'm not totally convinced about the frontal gradients, like I think, I'm not sure if you have more data that might sort of prove that those are replicable and meaningful and related to the connectivity in some way that makes sense.

Jean-Rémi King  37:00  
No, I think it's just, it just a hunch. So, this is just a first study, we did not look for these gradients in the prefrontal cortex, we just observed them. In retrospect, I think they make kind of sense from an anatomical point of view and let's take one example, in the case of the motor cortex. So, I'm not, I’m not coming, as I said, In the beginning of this, of this conversation, I don't have a background in the neurobiology of language, I don't have, I'm not a strong defendant on, let’s say motor theory of language. I, to me, it was kind of a of a story, like we have many stories in, in science and cognitive science in particular. And so first of all, seeing that you had strong activation in the motor cortex, it was like, okay, that's, maybe, maybe there is something to the story, and then to, to see that the representation in the motor cortex appear to be lower level than the representation that we observe in the premotor areas of in SMA.

Stephen Wilson  37:56  
Yeah.

Jean-Rémi King  37:56  
I think that's, that's also going on by the action. Right? It could have been it could have been the other way.

Stephen Wilson  38:02  
Yeah. I see it now. Yeah, you've got like, earlier, you've got a, you've got a lower layer response in this sort of dorsal part of ventral premotor cortex, which matches up to this area that like my friend, Eddie Chang, who you've probably familiar with his work, you know, so he has these, this paper from 2016. I think it's, I, the first author is I think, Cheung, where they show that that area up there, do you know, the paper I'm talking about? It has like auditory properties.

Jean-Rémi King  38:35  
I think it’s neural correlates of larynx. Is that correct?

Stephen Wilson  38:35  
Well, that's not what they say. But they kind of show that it's basically an auditory area. This paper is published in, some good journal, I forget which one. (Laughter) But anyway, it's, um, it's 2016. And they show that that area basically has auditory properties. Like it doesn't really behave like a motor area, it behaves like an auditory area. And so yeah, now I see it. I didn't see it when I was looking at this before. But yeah, that's out of all your frontal areas that’s the one that's like, linked up to the earliest layers in your model. So you're capturing the fact that that's more of a sensory area, and then the more prefrontal regions are deeper so yeah, okay, I, I buy it. I buy it now.

Jean-Rémi King  39:20  
I'm not trying to sell it. (Laughter) But, um, the first time I think I encountered these, these motor activation, I mean, it's, again, it's pretty recently given that I am a newcomer in field that, was with, with imagery. So when we do the social construction with imagery, we also see very early on, activation in the motor areas. And at the beginning, I was a bit suspicious because I thought they had just a social reconstruction issue, but now we actually see it also in intracranial recordings, and here was fMRI. So I think all of these different sort of pieces of evidence point toward a similar, a similar findings, so, clearly it is just, to me, it's just the beginning, right? This again, these are just activation, these are just correlation. We don't know how important these activation will turn out to be and I think the lesion studies and all this remain completely relevant. But it's, it's the simplicity of the overall organization revealed by this, by this mapping, which strikes me as first and think I really think of this as Okay, now we can see it a bit better how, how the language network is structured, to process, to process language, but everything remains to be, to be done and to be investigated more thoroughly in light of recent studies and anatomy and individual variations.

Stephen Wilson  40:41  
Yeah, sure. But yeah, this is a good, this is definitely a good ground rock to build on. Yeah, definitely recommend everybody checks out that figure. So, can we, can we move on to the next paper? Or do you want to say anything more about this one?

Jean-Rémi King  40:58  
It's, it's, it's your podcast.

Stephen Wilson  41:02  
Okay, I just, I do my best to structure things. But you know….

Jean-Rémi King  41:05  
There’s one thing that I can say here. Because, again, it was, I was surprised. And I think in retrospect, I shouldn't have been, given what was said in the literature. But I was just surprised. So when we do this comparison between, in this case with Wav2Vec 2.0, and the brain, we obtain a similarity score. And as I mentioned, we do it for each layer, and we find this structure. And then the rest of the paper really goes much more in depth in, like what kind of learning strategy leads to more or less high similarity scores. So we train Wav2Vec 2.0 on, on speech on non speech on speech from a different language that the participants were exposed to, and so on. And the striking thing to me at the time was that if you take random coordinates, you already get very high similarity scores, you get at least 70% of the variance explained from the best,  from the best model. And at the beginning, I thought this was we did something wrong, there was something like a mistake in our pipeline and all this, but I think it was already, it was already described as such, in different papers, including the paper by Josh McDermott from 2017. It just was not necessarily emphasized. And I think in retrospect, we, perhaps we should not forget about this, that even an architecture which is not trained actually has representations which linearly map onto the brain, just simply because of the, of the convolutional hierarchical structure of the network, you already get sort of a very good first step. And so the learning there comes as a, something which will increase the similarity, but it's, it's clearly not the only thing, which makes the model similar to the to the brain.

Stephen Wilson  43:00  
Okay. I, yeah, that was one of the things that I was, I had written down to ask you. So I'm glad you went back to it. Was, why does the untrained model succeed at all? But I still don't really understand why based on what you just ￼said, because, what does it, why does, why would the structure of the model be enough to make it match up to bold fMRI data?

Jean-Rémi King  43:27  
So I don't, I do not know why. And again, here are some, only some intuitions. The, the way I think of this is that sound is structured in time. And so if you apply a mathematical operation, which preserve the temporal components, you will generate a representation, a new representation, in a sense that it's an information which was not linearly readable before but is now linearly readable. You will, you will learn, you will have something which is not completely random. And so, the way I mean, that's the best way to explain this. The way I see representation learning or learning in general, is that you need to find combinations of features which are most usable, to act on the world right to or to predict what's going to happen. And this combination has to be structured, but space and time basically provide you with very strong inductive biases. So convolution in space or convolution in time, preserve the temporal spatial structure. And so when you do these nonlinearities in between layers, you learn more and more complex or you generate more and more complex representations. And if they are sort of bias towards preserving temporal or spatial structure, even the random ones may be a good start, as opposed to just completely scatter or shatter the, the, the information. So that's, that’s how I think of this. But I, the truth is that again, I do not know why it works so, so well with the random networks.

Stephen Wilson  45:16  
Do the random networks also replicate this kind of almost concentric structure that we were talking about?

Jean-Rémi King  45:24  
They have, they have a bit of it, yeah, but it's less strong than what we have observed with Wav2Vec 2.0.    

Stephen Wilson  45:31  
Could it be that the temporal receptive field increases, as you go deeper in the layers even in the random, in the untrained network would that be a potential explanation?

Jean-Rémi King  45:43  
So, Wav2Vec 2.0 is organizing in to two bricks, right? They are, there is a first deep net which is convolutional. And so here, the deeper you go in the network, and the more the larger the temporal receptive fields of each unit, simply because the, each unit is having its own receptive field. So it's built hierarchically. So naturally, you get this gradient. But in a transformer, there is no such thing. So you need to learn to build larger and larger receptive field because the transformer basically sees it all, like even the first layer is able to combine all of the tokens from anywhere in a context. But learning will bias the, in fact this is what we observe in, learning will bias the first layers to focus on what's happening nearby from a positional embedding point of view and so they will naturally build smaller receptive fields, fields. Whereas the deeper layer will tend to learn a larger receptive fields. But in principle, if you take a random transformer, then you do not have this, this bias.

Stephen Wilson  46:50  
Okay. So it couldn't explain that. It could only explain sort of asymmetries in the convolutional layers, not in the transformer layers.

Jean-Rémi King  46:58  
Absolutely. Yeah.

Stephen Wilson  46:59  
Okay, so there's still some things to understand here. Okay. Talk to them about let's talk about the next paper. Yeah?

Jean-Rémi King  47:10  
Sure.

Stephen Wilson  47:10  
This one's called ‘Brains and algorithms partially converge in natural language processing’ by Charlotte Caucheteux and yourself in Communications Biology, 2022. And this one, I think that the essential step forward of this one is to show how this convergence between the models and the brain is really driven by the ability of the models to predict. So that, that prediction is what explained success. Is that like a fair way to summarize its main point? 

Jean-Rémi King  47:53  
Yeah, I think the main result is that the, so the question that we ask, is, what, what factors leads an algorithm to be more or less similar to the brain? And so we already tapped on to this, this question just through the previous question. And so, okay, so we observe a functional mapping between language models, and the brain and we see that some models correlate better with a brain somewhere or correlates with, less with the brain. And in the literature, what was not clear is, what make, made an algorithm more or less similar to the brain, because they varied in pretty much everything, right? So if you compare GPT-2 and Birds and I don't know LSTMs and all this that are available online, they have different architectures, they have been trained with a different objective with different optimizers with different databases with different size of databases, and so you don't really know if, let's say GPT-2 is working better than any other algorithm is because it's a better architecture, it's because it's been trained with more data, or because of some of the factors. And so, when, when we had released, we released this, this study, the same week as a study from Martin Schrimpf and Ev Fedorenko, where they did this kind of mapping with the, the existing models like Burton and GPT-2 and RoBERTa and all of this zoo of models available. And so the, they  came to a similar conclusion that’s the, it seems that's one variable that's predicts very well with a model will be similar to the brain or not, is its ability to, to predict the next word. So we're really happy to see that two independent labs sort of come to the to the same conclusion.

Stephen Wilson  49:55  
Okay, let me say that again, just to make it real clear, because I think that's so important, like because that, that there's many different ways, there's many different model architectures you can consider and many different parameters you can vary. But the biggest factor that predicts whether a model is going to do a good job of matching the brain is how well it can predict the next word in that sense in which all of these models are set up. Because now we've kind of gone back to talking about techspace models, right? So we're not, we're not, we're not working with the audio or auditory signal anymore. We're back in sort of classic by which I mean, like the last two years, was language models where it's word prediction. Okay. Let’s go on. 

Jean-Rémi King  50:34  
Absolutely. Yeah. And so that's sort of the, the main result, and I think the whole, how do we, how do we know about this? So again, what we did is, we analyzed fMRI data, but also magnetoencephalography data, which were recorded by Jan Mathijs Schoffelen at the Donders institutes. And in this, in this study, participants had to read sentences in a heavy decontextualized fashion. So you only have a sentence, and then you have a five second delay. And then it's another sentence, which has nothing to do with a proof sentence. So it's quite different from the from the previous day when people are listening to podcast.

Stephen Wilson  51:13  
Yeah, this is an RSVP Paradigm, right? Rapid serial visual presentation. So, yeah. 

Jean-Rémi King  51:20  
Absolutely. So it's a reading, reading task. And so the question that we had is, okay, so what drives an algorithm or language model to be more precise, to be more or less similar to the brain. And so what Charlotte did is basically to retrain a lot of different architectures that are based on the GPT-2 architecture, based on BERT architecture. She tried to vary the depth of the transformer, how many activation units they are, for each layers, how many attentional gates they are. And so if you really did sort of a systematic grid search there, and for each model for each embedding, you can get one value, which is okay, how similar is it to the brain after a given training. And then you can just feed this to an ANOVA. In this case, we use a nonparametric analysis. But in principle, it's the idea is you say, amongst all of those factors, amongst the depth of the architecture, the weight of the architecture, the number of attentional gates, what we ask the algorithm to do analysis, which variable contributes to make a better brain score, similarity, which is increased with regards to brain activations. And as soon as we have one variable, which is the performance of the model to, to predict the next word, it basically sets up all of the variance. So the ability of the algorithm to predict the next word, no matter how big the network or deep the network, disability suffices to predict whether the model will turn out to be more or less similar to the brain. And that was very intriguing in some sense. And we did not necessarily anticipate that the other variables would have such a small contribution. They are all significant. Again, we're working with a lot of participants there, I think it's 100 participants for fMRI and the same for MEG. So basic, basically, all of the variables have a statistical significance, but they're really small as compared to, to this next word prediction effect. And that suggests one thing, which is that the behavior or the task of the model is really what matters. And then the architecture and the attentional gates and numbers of layers and all this, these already means to towards that, that, that task. And if you can do this task, then basically, even if you're a small network, that's that should suffice to represent things similarly to the brain.

Stephen Wilson  54:00  
So does that make you think that the brain is engaged in predictive coding? Or, or do you think that to get the next word right, you need to develop good representations of language?

Jean-Rémi King  54:11  
I don't think so, for either either questions. So, I think the first thing that it says is that the we should, we should take this seriously and not spend perhaps too much time on, on the architecture and trying to sort of pinpoint exactly what the relevance of the particular layer in learning intelligent representations, but rather towards the the goal and in this case, the goal is indeed next word prediction or word completion. And this is the goal that basically drives, drive the, the rise of smart representation is a representation that can be useful for something else. Whether the brain does follow the same principle, I very much doubt it. And the main reason here is that, unlike language models, we are not exposed with the same amount of words. So we cannot just rely on trying to predict the next word. Because in our lifetime, we don't just hear a sufficient number of words to to complete this task. So it's very interesting in, in, in the, over the past 60 years, there's been a lot of debate right on what needs to be innate, what can be acquired in the context of language. And there were a lot of arguments on really sort of math, math based arguments saying no, but it's just not possible to learn the structure of language learn syntax, with a simple exposure. Conclusion, you need to have an innate bias for, let's say, recursive structures if we go towards generative grammar. And I think now, it's pretty clear and this has been argued, for instance, by Steven Piantadosi that's, this argument is clearly wrong, like language models now can process language, they can retrieve syntactic structure. And they are trained with a huge amount of data. And so statistics, let's say, only suffices to to, to learn the structures of language. But now this perhaps clarify the debate and saying that, okay, maybe it's possible that the whole point where that's, this requires a huge amount of data data that we thought before were just not accessible. I think that's also why people got it wrong is that it was not conceivable that we could feed an algorithm with so much text. And so, now that this has been proven, the question still remains is okay with a relatively small amounts of what exposure, what's computational architecture, or perhaps what objective suffice to learn language efficiently. And my strong conviction is that next word prediction is not the right objective, because again, we don't hear a sufficient number of words per day. So just as a sort of rough estimation, like the few study that I could find, suggested, around 13,000 words a day, it varies immensely across across individuals, depending if you're a teenager, if you're a child, if you're, depending on your social class, and, and everything. But the average was 13000 words a day, which fits within 50 books a year, right? So 50 books a year. And then we can decide how many years of language you want for language acquisition. But basically, it's, it's going to be in the order of the 500 books, if really, we want to take a large margin. And GPT for now, is we don't actually have the exact numbers, but they train on hundreds of millions of books.

Stephen Wilson  58:04  
Yeah.

Jean-Rémi King  58:05  
So it's just orders and orders of magnitudes higher. And so, clearly we missed something fundamental here, the objective, that these algorithms are trained with this next word prediction objective, they just, I mean, they clearly, they work at scale, and this is very impressive, like everyone, I'm amazed every six months by the new power of the planning and language models in particular. But still, we have to recognize that there is something extremely inefficient, that they require an amount of data, which is just ridiculous as compared to what children, children do. And so I think that the historic question remains, and, and clearly the, we haven't solved this problem yet. And so I'm still very excited by this. So this was a long tangent, towards your question. But the question was about this next word prediction question, which is, is this, is this at the end of the day what we do? I don't think this is what we do. Because language models require too much data for it, for this rule to succeed.

Stephen Wilson  59:09  
Well, even I mean, yeah, so you're entering from a learning perspective. And that's very interesting tangent for sure. A lot of things that could go follow up on there. I mean, because well, one just quickly, like, you know, you can't it's not even sufficient about a learn from what the average child receives, right? You have to be able to learn from what the, the, you know, the child and the poor environment, because because people can learn even in very impoverished environments. So it's kind of I mean, it's gotta be able to deal with like, maybe 10% of like, what would be normal and still, the kids will acquire language with no problem.

Jean-Rémi King  59:42  
Yeah, absolutely. I mean, we can, we can tighten our hands in the back to me that the challenge of an even more difficult, but even, even with, let's say, rich environments, the amount of data that we are exposed to is ridiculously small compared to language models, so we don't even need to go into the extreme In cases. But I certainly agree that children have a natural bias for, for babbling for, for learning languages. This is obviously something that we do not see in other species. And even after heavy training, we don't manage to train gorillas or chimps to learn, for instance, sign language, or at least, whatever they learn is extremely poor as compared to what children are able to acquire in a couple of years. And so clearly there is an algorithm, there is an objective, or an architecture, which allows to learn language extremely efficiently. I think this is what Chomsky and many others had in mind, in a theory. And so I think this this line of thought remains extremely relevant. And we should not dismiss it just because we have now language models that work at scale, it depends, again, on the objective. If the objective is just to have an AI system that learns to process language at any costs. Sure, we've, we've arrived there, and this is done. And so perhaps we don't need generative grammars and theory, but if the question is, how do we learn, how,  what are the rules or principles that suffice to learn efficiently? There I think that we are just at the at the beginning, we haven't, we haven't found it yet. At least.

Stephen Wilson  1:01:23  
Yeah. Yeah, I agree. There's something in this paper that's interesting to me when I and, and Schrimpf et al., who you mentioned, did the same thing, right? Which is that they quantify,  when, when you're doing these model comparisons, you kind of need like a nice, simple, objective way of talking about how well the models fit the brain. And they use this concept of a noise ceiling, which you do, too. And so the noise ceiling is, basically it's how well you can predict one participants brain from the other participants brains. So it's like kind of inter subject correlation analysis. And the idea would be like, well, we can never hope to predict from a language model, what isn't kind of shared among all humans, right? There's always going to be like, individual variability in people's BOLD activation. So it's unfair to make the model or try and capture that, right? The model can only possibly capture what is shared among all humans. So the, so the correct denominator, when you're evaluating performance, is how well can you predict one person from other people? Okay? Did I say that right?

Jean-Rémi King  1:02:28  
Yeah.

Stephen Wilson  1:02:29  
And so in Schrimpf et al., I know they get very close to 100, they say very close to 100%. Like these models are predicting almost everything that can be predicted by other, from other humans. Do you get that in yours as well? Or no? I don't, I don't really, I didn't get that from your paper, whether yours was like that too. 

Jean-Rémi King  1:02:50  
No. We don't get that. But there are, so there  are, there is a first major difference, which is that in Schrimpf et al., they focus on the on the 10%, best voxels. So the, for each of the 10 individuals that they analyze, I think it's 10, maybe it's seven, I forgot, they take the 10% best voxel and then they do the whole analysis on this. And so that's quite different from what we're do in a sense, because we do the analysis on, on, on all the voxels.

Stephen Wilson  1:03:23  
I think that would be enough to explain the difference. Yeah.

Jean-Rémi King  1:03:26  
So that's the first difference. The second difference is that the noise ceiling that they use is based on an extrapolation. So if you sort of get fully in the method they extrapolate, okay, if we add more participants, can we expect to have a noise ceiling which ramps up more or less quickly, and they derive no ceiling from this sort of a projection. And we don't do it, we just take the whole cohort and we say, Okay, this is the noise ceiling. We don't try to extrapolate if we add a cohort of 1000 participants, we will we get something better. And so we've had a lot of pushback in noisy earnings over the years, I think it's a bit less the case now. But the first thing is we're really hard on this because we were always asked, okay, but you don't, you guys don't provide noise ceilings, participants never hear the same sentence and twice. And so we don't know how much variance is explained and therefore, we rejected the paper. And I find this, I find, I think this is missing the point. So I think noise ceilings can be useful, right? Because it gives us an estimate of how good, how good we are. But in many cases in the data that we've analyzed, we actually have models that were better than the noise ceiling that we build. And this, this is for an obvious reason, which is that when we train to predict brain activity from a given subject, given all other subjects, all of the subjects are also noisy. And so you're learning to predict something from, from a noisy data and so that's, that can be challenging, and so there are a lot of like arbitrary decision how you build this, this noise ceiling and I said projection is one thing, the voxel that you select is another thing, whether you base this on repetition or not repetition within subjects across subjects, all of this and then you end up I think, sort of building a whole analysis on something which is not so stable, it depends, it depends on on how you choose your noise ceiling. And so my, what I tell students is to not care about the noise ceiling whatsoever until the very end, we'll do the noise sitting at the end. But I think what is reproducible, is much more robust is to provide the actual effect sizes without no ceiling. So we say, Okay, this is our score on the raw waveform, this is what we, on the raw BOLD signal on the row images signal. And this is useful, I think, because if you're another lab, you will also get this, this raw signal and so you can evaluate your model with, with this kind of data. And it's, it will be easier to compare across studies, then if we have sort of a zoo of different, of different noise setting. I, again, I don't want to dismiss my ceiling altogether, I think they are useful, but they are, they are, they are too many choices at the moment. For them to be, to be a must go.

Stephen Wilson  1:06:18  
Okay.

Jean-Rémi King  1:06:18  
And in the case of language, it's even more the case and in other modalities. So in the case of images, for instance, we know from again, monkey electrophysiology and fMRI that if you present an image multiple times, most of the activations are similar across repetitions. But we know that in language is not the case, if you hear the same sentence twice, we know that, for instance, prefrontal areas are activated less the second time, even less a set time. And this can be, this, this is not just an adaptation effect, you can have a repetition, which is with sort of a destructor of multiple minutes in between, for instance, the work of Ghislaine Dehaene-Lambertz, from the early 2000 that's, that show this like if you hear the sentence, second or third time within the session, the prefrontal cortex react a lot less. And I think the reason for this is at least my intuition, but obviously, we need to do a lot of studies to confirm this, but the intuition is that we build language structures on the fly the first time we hear them. But as soon as we know what we mean, in a given sentence, we sort of form online, these idioms are we sort of can extract the meaning without having to build the whole syntactic structures and solve the ambiguities. And so many of the voxels will not have to, or many of the neurons will not have to be recruited basically to, to achieve the same goal. So perhaps is also the case envision, but in the case of language is particularly particularly the case, and so noise ceiling, the consequence of this is that you cannot present the same sentence multiple times and hope that the participants will process it in the same way. And therefore the very premise of my ceiling here is jeopardized.

Stephen Wilson  1:08:01  
Yeah, that makes sense. I mean, it's just the language is just more, more contextualized. It can't not be contextualized, relative to something like looking at a visual scene where you can like look at the visual scene, or at least early visual areas will respond the same way. Yeah, okay. Do we have time to talk about one more paper? Sure. All right, let's talk about, Caucheteux et al., 2023. This one's called ‘Evidence of a predictive coding hierarchy in the human brain listening to speech’, in Nature Human Behaviour. Just came out congratulations.

Jean-Rémi King  1:08:35  
Thank you.

Stephen Wilson  1:08:35  
And this one kind of starts from the premise that large language models are not as good as humans at processing language. And then you sort of ask why that might be. And you have a possible explanation in mind, which is that whereas the LLMs are predicting just the next word, what humans might be doing is making predictions, longer term predictions and that like so predicting more words and perhaps predicting some kind of hierarchical structure. And I really liked the, you start in your figure, you have this nice sort of layout of the experiment and figure one and the example sentence is ‘great, your paper’ and then the prediction is, ‘is not rejected’, (Laughter) which is what we all hope for with our papers, right? That's the prediction that we want to make, although it's not always borne out.

Jean-Rémi King  1:09:28  
Absolutely. Yeah. We start from this, I mean, this resonates with a lot of the points that we discussed earlier. But, so the example that we chose in this, in this figure, which is clearly an inside joke and I am not sure it’s entirely appropriate, but whatever, is to focus on negation. And so, I very much like negation, because it's, to me a very minimalist example of a very interesting composition. In the case of , of negation, if you say for instance, it's not rejected, you know that you need to combine the words in a nonlinear fashion, in order to retrieve the meaning of the, of the phrase or the sentence. If you were just to do a linear composition, it would be sort of a bag of words. So you would, at best, you would be able to say that it's something about rejection, but you cannot retrieve not rejected. And this is even more obvious when you have a slightly more complex sentence. So if you say, it is not small, and green, and you have another sentence, which is, it is small and not green, if you do a linear combination of this, you won't be able to, understand the meaning because you need to, to know that ‘not’ is applied to one adjective or, or specific words and not the other. And this representation has to be a nonlinear composition. So that's, that's why we sort of focus on this example. And the idea here is to focus on this issue that I mentioned earlier, which is that when you try to do this next word prediction objective, carry it scale, because you have a lot of data. But you're not pushing the algorithm to try to learn to predict the next idea. And so in the case of a composition, we wanted to focus on this, if we say it's great, it means that the following part of the sentence should be something positive. Rejected is a negative, at least from our point of view, it's kind of negatively correlated. But if it's combined with ‘not’, it's fine, you have the right the right prediction in mind, you know, it's going to be something positive, you have not rejected in your let's say, your validation, and you verify not ‘not rejected’ is something positive, and therefore, your, your prediction was correct. Whereas, if you had tried to predict, let's say, 'great, your paper is accepted', you would have gotten that wrong, because the true word was ‘not’ and the predicted word was ‘accepted’. And so, you would have told the algorithm, okay, you completely wrong, try something else, whereas actually, you had the right, the right idea. So again, that's a long tangent to come, to come about the same idea, which is that we should try to have algorithms that are, do not try to just do autocomplete. If you can do autocomplete, why not? but they also try to predict the next ideas. The next idea, perhaps, and then the next hierarchy of ideas, because you have structures that unfold over different timescales, what is going to be said, within this constituent within the sentence within this paragraph, what is going to, what is narrative structure of the story, all of these things are to some extent determined and can be predicted. And so we should optimize these algorithms on this goal, as opposed to the sole goal of trying to predict the next word.

Stephen Wilson  1:12:58  
Yeah.

Jean-Rémi King  1:12:58  
The analogy that, I that I have is that I, again, perhaps it's a wrong analogy, but what, when I think of this, I think of how we would teach a kid to ride a bike. And so here, when, what we have with language model is, we basically tell the child or the language model to just focus on what's exactly in front of the wheel. Okay, try to predict whether there's going to be a little stone or whether you should turn left or right, right now just avoid the obstacle, which is a very proximal short, sighted objective. And, of course, you need to do this, if you don't do this, you will fall. But if, if we want the child or the agent to be intelligent, we also need to say, anticipate your turn anticipates the, how you're going to, where you're going to direct your, your gaze. And ultimately, also, how do you drive around a city? How do you want to plan your route if you want to, to go from point A to point B? And if you have enough driving experience, perhaps you can do this only by looking at what's exactly in front of your, your wheel and you will learn every turn of the city. And I've no doubt that this is what language model basically does. But that's probably not the right and the most efficient way to, to learn. And so that's sort of the the idea here, we should have algorithms that are trained to predict multiple levels of representation, and not just hope that these levels of representation will emerge, just from the mere amount of data that we feed them with.

Stephen Wilson  1:14:27  
Yeah. So there's so much behind that example, which in the in the paper, you just put the example in the figure and, and, you know, I don't think you talk about the negation and the unique challenge of it. That's, that's neat. So, in the paper, you, you kind of get, you address this by introducing a forecast window, where the models have to predict different numbers of words into the future. Can you explain and that's, and the hope is to see whether introducing these forecasts windows into the model improves the correspond between the models and the brains. So can you kind of explain how the forecast windows fit into the whole architecture that was a little bit, I didn't really understand that when I was reading it?

Jean-Rémi King  1:15:15  
Absolutely. Perhaps I should first say that the negation example is something we are pursuing with Arianna Zuanazzi in David Poeppel’s lab, so that we have a paper on archive, specifically focusing on negation, but outside the domain of language models, but for those interested in the brain basis of minimalist  composition, like negations, that’s, I think, a cool paper to have a look at. In the, in this paper with Charlotte Caucheteux and Alex Gramfort, we indeed change the objective. That is, that's the goal, we want to change the objective of a standard language model so that it doesn't just predict the next word, but it potentially forecast longer term representations. And for this, we use two different strategies independently from one another. One, which is based on linear algebra, and the other one, which is based on optimization. So perhaps I can start with the optimization one because it's, I think, simpler, but also a bit less conclusive because it's sort of deep learning magic, as opposed to linear algebra, which sort of decomposes things in a clear fashion. Which is the exact reverse, what we did in the paper. So, in the optimization case, what we take is, what we do is we take GPT-2, we train it to predict the next word. And then we take another GPT-2 and we train it to predict the next word and the latent representations of the next words, and I think we take something like the seven or eight words after the current item. So if, for instance, you have ‘once upon a’, the first model is trained to predict time, and the next model is trying to predict time and what's going to happen in seven words. But we know that's what's going to happen in seven words is non deterministic, it's very hard to know what word will be said in seven words from now, just because there are so many possibilities, sort of the forking paths problem. So what we train the algorithm to do, is to learn to predict a latent representations of the future words.

Stephen Wilson  1:17:24  
Not the actual words.

Jean-Rémi King  1:17:25  
Not the actual words, but the latent representation. And so, two objectives, one, which is proximal, language model, next word prediction, and one which is distant, which is trying to predict the latent representations of what's going to happens in seven words from now. And what we show, is that these dual objectives lead to activations which are most similar to the brain, then the sole, proximal objective, which is next word prediction, that's sort of the bottom line. And then we have this other approach, which is not based on GPT-2 fine tuning or retraining. It's based on sort of a  linear algebra decomposition. So, what we do is a bit more complex technically, but conceptually, it's, it’s the same. So, we take the activations of GPT-2 in response to a given word and it’s preceding context. And we asked, okay, what is the similarity between GPT-2 and the brain? That gives us one score. And then we say, if we were to add to these activations of GPT-2, the future activations of GPT-2, would that increase the similarity with the brain? And the answer is yes, up to, well, it peaks around 9, 10 words, if I recall correctly. And so, we can do this systematically, we can say if we add the future activations of the word, so let's say we do a, we sort of peek into what's going to happen in the future, we embed these words in, into GPT-2. We extract the activations, we use this additional future activations and we stack them on to the current GPT-2, we ask, is it similar ‘yes’ or ‘no’ to the brain, we obtain a higher similarity score,  higher brain score. And we can do this and very systematically the number of words we peeked into, in the future, we can vary how deep the representations of these words, these words should be to do this, this similarity assessments systematically. And the point, all of this is very technical, and I, I cannot imagine how hard it is to, to, to follow what I'm saying in a podcast without any diagrams. But the point is that we have methods to evaluate whether an algorithm which has long term forecasts predictions is more similar to an algorithm which has only short term forecast, forecast predictions like GPT-2. So that's the method. First result is that it works better.  

Stephen Wilson  1:20:06  
Hang on a second. I'm also curious to know whether people understand. I think so actually, because maybe for me, at least, I mean, I guess I've already read the paper, but,  I understand it more already having heard you say it out loud. I think there's something about like just describing things in natural conversational language that just makes them easier to understand, at least I hope so. That's the premise of the podcast. So yeah, I think people will understand the gist of it. And there's always the paper if they want the details. Okay, so tell us what you found.

Jean-Rémi King  1:20:41  
Sure. I mean, I didn't mean to under evaluate your audience. I know this is a pretty advanced audience. So the results, the results is that, if you enhance these GPT-2 LLMs, so a language model, with long term forecast predictions, the activations ends up being more similar to the brain. That's sort of the basic finding. And this is not the case everywhere in the brain. It's the case really in the length standard language network. So it's Superior Temporal sulcus and gyrus, Prefrontal areas, especially IFG, a bit of the Angular Gyrus, but it's, it’s not the case in let's say, the ventral visual stream or in the motor areas. I have a doubt, I haven't looked at the picture recently, I don't I don't know whether we have again, in, in any of the voxels in the motor cortex. But generally speaking, it's really the expected language network. Is typically the type of areas that you would end up with, if you were to do a localizer on language, as opposed to some of the tests. And so those are the regions which are better explained more similar to the algorithm, which is a long range forecast and the short range forecast one. And with, from this, we can systematically decompose, how is the forecast structured because we can systematically vary whether the forecast should be short-range or long-range or middle-range. And so we try with trying to predict the next word or two words from now, three words, four words, and so on and so forth. And it peaks around. I think, again, between eight and 10 words. It varies slightly, it's not an exact number, depending on the voxel you look at. And what's really interesting is that it's these forecasts depends on, the range of the forecast depends on where you are in the hierarchy of language. For instance, if you're around the primary auditory areas, the forecast seems to be peaking at a shorter range than if you are in the prefrontal cortex. Again, this is not just a representation, it's, it’s the prediction is. You wouldn't be, you would have a better model of the prefrontal cortex if you enhance your language model with a long range forecast. And you would have a better model of the auditory area if you had a short-range forecast, and you have sort of these gradients in between those two, those two extremes.

Stephen Wilson  1:23:16  
Yeah.

Jean-Rémi King  1:23:17  
That’s sort of one, one dimension of this forecast structure. And the other dimension is not how far ahead, the forecast happens, but how deep it is. And so for each future word, we can try to predict the world level, which is sort of the lowest possible level. But also, it's the representation that it has in the first layer and the second layer and the third layer, and so on and so forth. And that gives us sort of a level of abstraction, loosely defined as how deep the representation is in a transformer. And again, for each voxel in the brain, we can say is, is it better to have a forecast which is rather shallow or weather deep in the network. And again, we observed that prefrontal and parietal areas tend to be associated with deeper forecast, and auditory areas tend to be associated with shallower forecasts. So that resonates a lot with this idea of predictive coding where you would have noticed one prediction, but a hierarchy of predictions. And these predictions are organized similarly to the hierarchy of inference of representation, which is that lower level areas represent the past and predict the future in a relatively short timescale at a relatively shallow level. Whereas the deepest levels of the of the language network would be learning and representing much longer context would be anticipating much further away, well 'much' is perhaps an extension, it's further away than the lower level regions. And we predict these,  these more abstract levels representations.

Stephen Wilson  1:25:01  
Yeah, and it looks like yeah, specifically like prefrontal, not premotor, Angular Gyrus is the part of parietal lobe, which looks to be the most extreme on that measure. And then also, I'd say like ventral temporal, I kind of inferior temporal. It looks like that, I mean, that that also all kind of makes sense in terms of being like, further downstream than those primary auditory areas. I probably should go and eat dinner with my family. But I do, but I do want to ask you one more thing, if I can?

Jean-Rémi King  1:25:37  
Sure.

Stephen Wilson  1:25:38  
You have this really neat analysis. It's very complex. (Laughter) Little, I'm a little bit hesitant, but I but I'd love to hear you explain it, where you look at this semantic versus syntactic forecast. So I mean, this is this is a topic which I just think is interesting. The extent to which we can, you know, parcellate out the language network along those lines. So can I kind of get you to tell us how, how that, how you distinguish between those different kinds of predictions? 

Jean-Rémi King  1:26:10  
That's actually, I think, my favorite paper from, from the PhD of Charlotte Caucheteux, who defended recently, a PhD. So we add this analysis is derived from from this paper. So we had a paper at ICML, I think it's 2021, where we developed this analysis to disentangle syntactic from semantic representations in the brain using language models. And here, we're just applying this analysis in the context of forecasts. But in the paper, we are applying it in the context of just representations. But it's completely analogous in analytically speaking, and the idea is, is not that difficult. It's, it’s  quite mathy the paper, but the idea is, is I think it's pretty simple. Usually, what we do is we compare a deep nets to the brain in response to the same inputs. So the deep nets here, ‘once upon a time’, the participant here ‘once upon a time’ and we evaluate with the activations that are similar to the activations of the brain. And in this paper, we thought, okay, perhaps what we can do is not present the same inputs. But to present an input with the same syntactic structure but a different semantic content. So for instance, ‘once upon a time’, I'm not able to pass this quickly. (Laughter) I can take another example, if you take, if you take the following sentence, ‘the giant bowl is on the table’, you can create a sentence, which is ‘a red cow lies near the house’, it has the same constituency tree, it has same dependency tree. But of course, it doesn't have the same meaning. And so what we do in this paper is, we made a little algorithm, which generates a ton of sentences. And we try to optimize this algorithm. Basically, at the end, generate sentences that have the same dependency tree as the original sentence. And we present them to the algorithm. And we extract the activations for each of those sentences, which are syntactically matched. And we, we, the result of this process is that we can have an activations in the deep nets, in response to sentences with the same syntactic structure. And we can use those activations to try to predict brain activity. So with this, basically, what we have is, we have a model that tells us what is the expected activations given syntactic structure. And this model is not derived from linguistics, we don't have any ideas about merge and, and movements and, and all this. It has some constrain, and because of, we do generate sentences which have the same dependency trees. So it's not completely random either. But it's kind of a linguistic free model, in that sense. And we can try to see which areas are predicted effectively by these activation, these syntactic activations in the model as opposed to a full language model. Okay, so that's one analysis. And then we can compare these, these, these effects to a random model or model which only has access to position. So you will generate sentences which have the same number of words, but they don't have the same syntactic structures. And finally, compare this to a model which has the exact same sentence and so it has both syntax and semantics. And by doing the systematic comparison, we can try to see which areas basically are accounted for by syntax, syntactic representations, which areas are accounted for by a semantic representations and which areas are associated with both representations. So you need both syntactic activations and semantic activations to best account for the activation in a given voxel. And so that's what we do in here in this paper was forecast and what we observe very briefly, is that the syntactic forecast seems to be relatively shallow and relatively around Superior Temporal gyrus and Superior Temporal sulcus. And it's not heavily associated, a bit, I was a bit disappointed by this, but just data, it's not necessary, for instance, heavily with IFG or with the Angular Gyrus. It tends to be relatively centered around the temporal lobe, where the semantic forecast appear to be more distributed. So that perhaps is a clue towards, it's the first step towards trying to systematically decompose these activations into something that we can relate our theories on, as opposed to just say, this is a similar activation between the deep nets and the brain, but we have no idea what this activation is actually represents.

Stephen Wilson  
Yeah, it's great. I mean, you kind of like, I guess, in a way, you're, I mean, just to put it in a different way, you're degrading the signal in different ways and seeing whether it make whether it matters or not, like in taking out the semantics, but keeping the syntax or vice versa. You know, those of us, that those of us that work on syntax, and are not, you know, living in the past, we'll be very happy to see the STS-centric finding. (Laughter)

Jean-Rémi King
Well it depends on the school of thought, I guess. 

Stephen Wilson  
Well, I mean, okay, like, you know, with patients, we just don't find syntactic deficits, following from IFG lesions, we find very consistently from posterior temporal lesions. So this definitely accords with my expectations, at least. And it's quite, it's interesting, this syntax map is quite lateralized, maybe more so than the semantic map. We haven't talked much about lateralization today. I mean, that's gonna be a topic for another time. But this is this is one of the more lateralized ones, which I liked that finding too. So yeah, that's very cool. I think you did, you really did explain it well. And I would just encourage all the listeners to go take a look at this paper, because it's so rich, and there's so much, you know, there's so many details to discover here.

Jean-Rémi King  
Yes, I think there are probably a lot more things to try to interpret from those brain maps then than we did.

Stephen Wilson  
Yeah, that's for sure.

Jean-Rémi King  
That's, that's very clear.

Stephen Wilson  
Yeah, no, there's a lot more here than we're talking about really, really just scratching the surface of these papers. They're all three of them, like, that we’ve talked about very complicated, and they've got whole figures in them in whole analyses that we haven't even touched on. But, you know, that's there for, for the readers. Okay, well, I don't want to take up any more of your time. It's been really great talking with you, and learning more about your work. And I'm glad that you're in Marseille that's gonna make it even more convenient for SNL.

Jean-Rémi King  
Well, thank you very much for giving me the opportunity to talk about this. It's a pleasure to discuss those topics. And yeah, hope, I hope the readers or the listeners, I should say, will not be afraid by the technicalities. It's true that these papers tend to be, they tend to have a lot of technical stuff and math and regressions and all this. But I think you can also understand the paper without going into these, these elements. too deeply. Of course, you should if you want to criticize and see the the potential pitfalls and assumptions that we make. But, but please don't be afraid by the technicality. I think that, one can really get the message without understanding the math. 

Stephen Wilson  
Yeah. These papers are readable to our field. I mean, yeah, nobody, including me, I mean, I certainly don't understand lots of the details. It's not my area, but I can definitely get the gist of them by you know, they are written in a way that you could read them. So yeah, really enjoyed it. Thank you very much. And I look forward to seeing you in your country in a few months.

Jean-Rémi King  
Yep. Likewise. I’ll see you in Marseille. 

Stephen Wilson  
Take care. Bye. Okay, well, that's it for episode 27. Thank you very much Jean-Rémi for coming on the podcast. I've linked the papers we discussed in the show notes and on the podcast website at langneurosci.org/podcast. I'd like to thank Marcia Petyt for transcribing this episode, and the Journal of Neurobiology of Language for generously supporting some of the costs of transcription. See you next time.