Stephen Wilson  0:06  
Welcome to Episode 16 of the Language Neuroscience Podcast. I'm Stephen Wilson. My guest today is Cory Shain. Cory is a postdoc in Brain and Cognitive Sciences at MIT. Having recently completed his PhD in linguistics at Ohio State. We're going to talk about a very interesting preprint he posted a few weeks ago, entitled 'Robust effects of working memory demand during naturalistic language comprehension in language- selective cortex'. Cory's co-authors on this paper are Idan Blank, Ev Fedorenko, Ted Gibson, and William Schuler. I'm excited to talk about this paper with Cory, because I think it sheds light on a very long standing debate about the nature of working memory for language. Is it linguistically specific, or is it domain general. This paper also bears on another long standing discussion about the relative explanatory value of working memory as opposed to surprisal based metrics of language processing demand. By the way, if you're listening to the podcast right after it comes out, you may have time to catch Cory presenting this work at the Society for the Neurobiology of Language Conference. He's presenting in Slide Slam Session B, on Tuesday afternoon American time, which will be Tuesday evening in the UK and Europe. Wednesday morning in Australia. Okay, let's get to it. Hi, Cory. Welcome to the Language Neuroscience Podcast.

Cory Shain  1:20  
Hi Stephen, thanks a lot for having me. It's such a treat. 

Stephen Wilson  1:23  
Yea, I'm really glad that you agreed to come on and talk about this paper that you've just come out with that I read the preprint and I thought it was really interesting and thought that a lot of people would be interested in it. So yeah.

Cory Shain  1:36  
Absolutely. I'm excited to talk about it. 

Stephen Wilson  1:38  
Yeah, so before we get to this paper, let's talk a little bit about what brought you into this line of work. I kind of looked at your nice website, and I see that you started out in linguistics at Ohio State. You got a master's degree a while back like 2009. 

Cory Shain  1:55  
Right. 

Stephen Wilson  1:56  
And it looks like you basically did some descriptive linguistics back then.

Cory Shain  2:01  
Yeah, that's right. So I definitely came here  from a language angle as a language person. I majored in linguistics as an undergrad because I was really interested in language structure. And yeah, for my honors thesis, and I then did a kind of like a combined bachelor's master's program, looked at Paraguayan Guarani and tried to characterize the distribution of a particular grammatical phenomenon that occurs in that language. With my advisor at the time Judith Tonhauser. Yeah, so it was very much kind of classical linguistics and I finished that back in 2009. Then left academia went to work for a while and did a number of things but yeah.

Stephen Wilson  2:55  
Yeah, what kind of work did you do after that? If you don't mind me asking? 

Cory Shain  2:59  
Sure. Um, yeah. So right after I worked for an NGO, like a nonprofit organization that does, like literacy development in languages that like writing systems. So my wife, Rachel, who is also linguistically trained to the same undergraduate program that I did, and I went to live in various places that our end goal was to work in. West Africa, which is French speaking and so we spent some time in France, and then went to West Africa to work on. The goal is the Dogon languages, which is like this continuum of very interesting kind of like dialect continuum of languages and spoken in the country of Mali to kind of study the structure and make proposals about how this might be encoded in writing and develop mother tongue literacy programs. 

Stephen Wilson  4:04  
How cool!

Cory Shain  4:05  
Yeah. It didn't end up working out. But that was the that was the plan. There were kind of political circumstances in Mali shortly after our arrival that's kind of cut that short. 

Stephen Wilson  4:15  
Right. 

Cory Shain  4:16  
So we sort of wound up in various other places, Burkina Faso, Cameroon, and I did end up spending some time working on a Cameroonian language that still interests me and that I've been very slowly working on with some collaborators, a Cameroonian linguist by the name of Sammy Mbipite Tchele and a linguist at Ohio State Furin Kerline, on some kind of like, theoretical linguistic work based on findings from that, that experience.

Stephen Wilson  4:47  
Yeah, I believe the language is called Iyasa. Is that how you say it?

Cory Shain  4:51  
That's right. Yeah. It is spoken by about 3000 people in southern Cameroon.

Stephen Wilson  4:55  
Yeah and you've got a lot of, you could kind of some lexical information on  your GitHub, you've got a database of a dictionary in progress?

Cory Shain  5:05  
Yeah, it's in progress but but stalled, I suppose. So yeah, the materials that I collected and annotated from my time working with Sammy, I've released in kind of like an unpublished form on the web.

Stephen Wilson  5:19  
Yeah. Well, that's isn't that like how things get published these days? (laughter)

Cory Shain  5:23  
I guess it's like, it's like a preprint. But, yeah, yeah, that ended up, we ended up leaving for various reasons after working for them for about five years. Then I came back and worked as an academic advisor, Ohio State University administration, and in particular I was helping undergraduates choose their areas of study and then in the process I chose mine.

Stephen Wilson  5:50  
Like, I need more linguistics.

Cory Shain  5:53  
Yeah. I just, I started I mean, like, I guess I was always interested in language and computation. So yeah, how language is learned and processed. I didn't really focus on it, so it was like a side interest as an undergraduate but I really missed my kind of brief graduate school experience working in the admin world, and eventually just decided to, to apply to grad school again, with a computational emphasis.

Stephen Wilson  6:26  
Yeah, I mean, I was kind of, I enjoyed sort of inferring your trajectory, from what I could see on the web, because I had a similar kind of path into this field. I also, as an undergrad, did descriptive linguistics and some early work. I worked on an Australian Aboriginal language called Wagiman, that's spoken by even less people than speak Iyasa. 

Cory Shain  6:47  
Wow!

Stephen Wilson  6:48  
And also kind of just found my way gradually into more sort of cognitive neuro questions. Yeah, so in your PhD, obviously, you became like, I'd say, like, a computational psycholinguist. Is that what you would call yourself?

Cory Shain  7:04  
Yeah, that's exactly right. Yeah, so I wasn't planning on working on the neuro end of things at all, that ended up developing later. So yeah, I started working with William Schuler, a computational linguist at OSU, who works on computational models of syntax and semantics. So yeah, how we compose representations in our minds during  incremental word by word sentence processing, to represent the meanings of utterances that we hear and I also ended up by by the end of my first year working also in Micha Elsner's lab, who, who works or at least in collaboration with me was working on kind of more sound system side. So what linguists call P side stuff in phonetics and phonology and in particular computational models of how sound systems might be learned by children. So it was really computational and linguistic, and I was in the linguistics department and then, but but one of the projects we ended up working on, I worked on with William wound up kind of getting us in touch with Ted Gibson from MIT, who's a major influence on theories of the role of working memory, in language processing, because that was a topic that we were studying from a behavioral angle, reading times from an eye tracker. And then, indirectly through working with Ted, I ended up getting connected with Ev, and a large fMRI dataset that she was interested in studying with respect to some of the kinds of questions. 

Stephen Wilson  8:41  
Yeah.

Cory Shain  8:42  
So that's the entry into neurosciences, is kind of like as like, I guess, as another modality for me as a computational linguist, like another testbed for ideas about the computations that underlie language processing.

Stephen Wilson  8:57  
Right. Ah, yeah, Okay. So I was kind of noticing that your papers with my friends, Ev and Idan, Ev Fedorenko and Idan Blank, came before you actually join them now as a postdoc, right? So I was like, how did this happen? You know, like, it's like, it's not like you did this, you know, computational psycholinguistics PhD, and then moved into neuro. It's like this collaboration arose in the, in the course of your PhD, and now you're kind of like, working with them in MIT, right?

Cory Shain  9:25  
Yeah, that's right. I ended up starting to work with them, I think, in my second or third year of my PhD. So we had these kind of like parallel collaborations going on and she was like my, one of my unofficial thesis advisors and then yeah, upon graduation, I started working as a postdoc in her lab, which is where I am now.

Stephen Wilson  9:45  
Cool. Yeah, so let's talk about this particular paper that that we're going to focus on today. It's called 'Robust effects of working memory demand during naturalistic language comprehension in language-selective cortex'. And, you kind of like the way that you lay it out you, you put forward these three key questions that your data is going to bear on.

Cory Shain  10:10  
That's right.

Stephen Wilson  10:10  
Three key claims that I think kind of the dominant view in the field if there is a single field. So firstly, I'm going to sort of, I'll say what I think that the claims are and then, what I'm really interested to hear from you is like, what are the alternatives to these claims? Like, you know, what does the other side say? So, you know, claim one would be that, when we build mental representation of sentences, we do word by word structure a building. Second claim would be that this is a computationally costly operation that involves memory. And thirdly, that the memory resources in question are domain general. Is that a fair characterization of your

Cory Shain  10:52  
That's right. Yeah, so we've sketched that out. That's not necessarily our position, but we've sketched it out as like, I guess, as you said, the majority position if there is one in Language Sciences.

Stephen Wilson  11:03  
And so what are the alternatives to those three?

Cory Shain  11:07  
Right. So, the first point again, that's um,  people do rich and detailed word by word syntactic analysis, as they are listening to or reading language is a, has been a dominant assumption of studies of language in the mind for a long time, and has been widely supported by many different studies. But there have been, especially with the recent shift towards naturalistic stimuli, so like, traditionally, psycho linguists would analyze or test second linguistic theories by constructing sentences, usually fairly complex sentences, often involving relative clauses and English of different structure, kind of bombarding people with different variants of these and then seeing, like, what, what the effect sizes are for these structural manipulations. And so, starting in the kind of, I guess, 2000s, some critique started emerging of this of this approach, which I think is important, valuable insights but may also potentially overestimate the effects of syntax on the, on the measures that we obtained, because it's so different from the normal conditions of language comprehension. So like, when you and I are talking now, there's a, a set of communicative goals that we share, that motivate our use of language and the degree of attention that we pay to the other person's utterances, that just isn't there if you're just reading a whole bunch of different sentences about John and Mary and different kinds of predicates related to them,

Stephen Wilson  12:53  
Oh, the reporter that the senator attacked,  

Cory Shain  12:55  
Yeah, that too. (laughter) 

Stephen Wilson  12:59  
We have a really good theory of how your parse sentences about reporters and senators.

Cory Shain  13:02  
Right. So conceptually theorizing about whether the things that we were getting in the lab are characteristic of language processing, like representing core essential computations, that we, that we do everyday during language processing, or whether they are, are in part induced by the experimental task and unless you do naturalistic studies, it's difficult to rule that second possibility out. And so some folks are doing studies of primarily reading times at the beginning, where you have eye tracking or self paced reading, which is a task where you kind of page through words on a screen by pressing a button, then you, in both cases, record how long you spend looking at each word as an indicator of how hard that word is to process and then you use theory driven regressors of expected comprehension difficulty in order to see whether those effects are there in the time courses. And, and so there were kind of, like, I guess, surprisingly weak effects of memory, in eye tracking, from a study by Demberg and Keller in 2008, that didn't have the same kind of generality that you might expect, based on theory, but only emerged when we looked at a restricted class of words. And then some other studies that seemed to maybe even indicate the opposite opposite pattern like a dissociation from memory demand, where you actually read faster. And then there was, cited in the paper, some kind of parallel work by Stefan Braun, and Morten Christiansen and others, who were challenging the notion that these kinds of like detailed syntactic analyses are typical of language comprehension in the first place. That we might be kind of more approximately, representing, I guess, just the information that we need for a given social setting from the language stream rather than doing detailed word by word structure building. So Frank and Bod 2011 was an important contributor to this discussion, where they showed that a computational model that just takes into account the predictability of one word following the next one without any notion of structure, explains reading times in naturalistic settings, just as well as one that does detailed syntactic analysis.

Stephen Wilson  15:29  
Yeah, that's awkward.

Cory Shain  15:31  
Yeah. So I think that this, this is a surprising finding and in general, I would say that there's been, this back and forth on this so there have been other naturalistic studies that have reported effects of syntactic processing. But they tend to be much less more, the results are more mixed, and but certainly less pronounced than they had been using constructed studies. But I think it raises important questions about what we're measuring in these different experimental settings. So that, that's point number one.

Stephen Wilson  16:02  
Yeah, no, that was a really good explanation, actually.

Cory Shain  16:05  
Okay, um. Yeah, like, what are we really doing? Is that, are we kind of like, roughly the kind of like, gist processing of the things that we hear in everyday speech? Or are we kind of mimicking like a computational parser? Sorry, there's a bit (of sound) in the background.

Stephen Wilson  16:24  
Yeah, I hear the kid. I am amused. (laughter) Okay, yeah. Is it, Yeah, do we like really parse in a kind of, as predicted by linguistic theory, or do we just kind of like, figure it out and go with the flow seat of our pants? 

Cory Shain  16:39  
Yeah, um. 

Stephen Wilson  16:41  
And then the second is that the, you know, that this is computationally costly memory operations. 

Cory Shain  16:50  
Right.

Stephen Wilson  16:50  
That's, that's the second claim and then the alternative is..

Cory Shain  16:55  
Yeah, so this is an interesting kind of subtle point that arose in, with work by John Hale and Roger Levy on predictability effects and language processing. So prior to, I guess, 2001, the dominant view was that the fluctuations and comprehension difficulty were driven by working memory demand and there are basically two kinds of demand that people had talked about. One was the demand involved in retrieving items from memory, as required by a processing word, so for example, if I process you know, the verb of a sentence, in order to compose the semantics, like the predicate denoted by that verb, with the subject, that the verb predicates, then I need to retrieve my representation of that subject from memory and so that that retrieval process may be more or less difficult. So that was one potential source of difficulty and variation, that could shed light on the computations that underlie language comprehension, and the other one was storage. So like, I may be able to anticipate that a structure that I'm building or recognize that a structure that I'm building is not complete yet, that I need to keep it actively in memory, until I get some critical component that will allow me to build a unit as a single unit in my representation, and kind of close that off. And until I do that, I need to keep that stored. And so these two, these are kind of like the dominant ideas. And then there was this alternative called Surprisal theory. I kind of like originating with John Hale, and then for the developed by Roger Levy, which is that the the dominant cost of language processing may actually be allocating activation among the possible interpretations of the unfolding sentence. So this is a kind of like a radically different view of what the main work and like the comprehension was. So like in Surprisal theory, there are potentially it's kind of assumed parallel processing model where you have many different possible interpretations of the sentence, available at any given time, with varying levels of activation allocated to them according to their probability as a correct interpretation of the sentence. So this is, there's some kind of counterintuitive, like assumptions. So for example, as I'm reading, like, you know, 'the cat ate dog' like, there's some potential activation allocated to a totally unrelated interpretation about, you know, like, elephants or whatever, like, there's just like, but what I want but I want to put more probability mass on stuff that has to do with cats and eating. And the job of the parser is to kind of narrow it down so this representation from at the start of the sentence, anything that could be altered to the one that most likely corresponds to the person's intention. And so that actually, the primary work and the driver of comprehensive difficulty under this view is how much reallocation of I need to do at a given word. Basically how much information a word contributes towards my understanding of the entire sentence's meaning. So if I start a sentence with 'the', a lot of sentences start with 'the', that doesn't really narrow my interpretational space that much. But as soon as I hit cat, the following word that narrows it a lot. So I get a lot more information from cat like, I've really narrowed the topic of conversation from the space of anything I could talk about to this one particular like small mammal and so that is going to have more information and information is quantified as surprisal, the negative log probability of a word in context. And so under this view, it kind of representing cat and memory, like building all these memory structures for a particular structure are, I have a negligible cost, essentially, like the representations are just available to me and the cost is choosing the right one.

Stephen Wilson  20:53  
Yeah, I wonder if it's like a little bit like begging the question, because like, wouldn't you still have to kind of have a covert parsing operation in order to I mean, I can see how surprisal would be easy to quantify empirically, right? That, but if you think about what would make a surprisal based system work, wouldn't the person still need to be parsing?

Cory Shain  21:15  
Yeah, surprisal is not opposed to parsing and especially in the early variants of the theory, parsing was assumed, so like, I can't remember the precise parsing model assumed for like the computational results in likely 2008, but it, the idea was that, yeah, you have this vast space of richly structured parse trees, and surprisal is derived as a kind of a secondary byproduct of marginalizing over them. So like, basically taking a weighted average of all of their predictions about what the next word is, in order to assign a probability to the word. So, word prediction was kind of like an indirect consequence of this, this step of interpreting.

Stephen Wilson  21:59  
Okay. So I guess the real rub is, is that it's not computationally costly, the memory part of it is that,

Cory Shain  22:07  
Yeah, so building and representing these structures in memory is not thought to be costly, or like, relative to the cost of reallocating between interpretations. 

Stephen Wilson  22:16  
Got it. Okay. That makes sense. Cool. So then, we think we can move on to the third sort of issue, which is, you know, the sort of maybe the dominant view, I don't know if it's the dominant view, but what you put forward as the dominant view is that these memory resources are domain general.

Cory Shain  22:34  
Yeah, I agree. I'm not sure if it's the dominant view, because often times, like the domain specificity of the assumed resources is not made explicit. So it's difficult to say, in some cases, but it's definitely a view that has been advocated, that I think is implicit in many discussions of working memory, even even those that aren't making direct reference to the brain. So the question is, working memory is needed in a lot of different domains, not just language, and in a way that is potentially similar to language. So for example, if I'm processing like a complex mathematical equation with hierarchical structure, I may need working memory to keep elements in memory and compose them, just like I might with a higher unstructured sentence of English. So it's a reasonable assumption is that the whatever resource allows us to do one and allows us to do the other. We have this kind of general store of working memory that we can draw on across domains.

Stephen Wilson  23:35  
Yeah, I think you're right. Like I think some researchers are quite explicit that they do think it's the same store and that other times, you know, just the fact that as you said, in your paper, like, if you appeal to the principles of working memory, that are derived from other domains, then you're kind of implicitly saying that you think that it's a similar kind of construct that's being drawn on in language parsing Okay, cool. So, you know, we're gonna let them move on to the experiment that you do to get at these questions. So do you want to, it's a story listening dataset, you got some functional localizers? Can you tell me about the basic design of the experiment?

Cory Shain  24:12  
Yeah. So use materials from the natural stories corpus, which is a corpus of 10 different short stories, when they're read, they're typically around five minutes long and they are, were originally kind of proposed and used for self is reading experiments. So this is kind of like large online groups or self paced reading experiment using these materials and they were also adapted for use in fMRI, and they were recorded. So they're presented auditorily rather than visually. And yeah, so participants, like for the main task, just listened to some number of the stories and the fMRI scanner and we record the different responses and this was done. It's a fairly large dataset. Has 78 participants collected over several years, I think five or six years. So I kind of came in at the tail end of data collection.

Stephen Wilson  25:08  
That's a great time to arrive. (laughter)

Cory Shain  25:13  
So that's so that was like, done and, and many other people's labor that kind of gave us this really rich and interesting good stuff to work with. And, and because of this, at least we characterize it as naturalistic in that it is the materials originally contextualized in a way that isolated sentences aren't. So it's still not, there's many kind of like domain dimensions of naturalism when you when you talk about languages, and it's still not conversational, like there, it's still it's passive listening, which is, you know, like one particular modality of language use, but probably not the dominant one. But, but natural in the sense of, I guess, having a clear communicative goal and communicative goal and rich context like we would expect in ordinary languages,

Stephen Wilson  25:59  
And the stories contain, like, complex syntactic structure and so on, like, on purpose or incidentally, or

Cory Shain  26:06  
Yeah, on purpose. So they, they started as natural, naturally occurring, stories culled from various sources and this was Richard Futrell, at UC Irvine, who headed this project up. So I'm not I wasn't part of it, exactly sure how the stories were chosen. But then yeah, they went through and kind of hand tweaked to them in order to, the goal was to over represent kind of, like tricky and infrequent syntactic constructions, and, and lexical items relative to typical writing, right? I'm honestly not sure how that, how that bore out, kind of statistically relative to other, like truly naturally occurring data sets, like the end purpose, but that was, that was part of the design.

Stephen Wilson  26:54  
Cool. So they just went through and took out all the complimentizers (laughter) and created garden paths?

Cory Shain  27:01  
I think, yeah, just like, you know, like more object relatives, for example, than you would expect and yeah, it's just kind of like, a based on kind of known corpus patterns from English. Yeah, trying to like over represent things that are likely going to be challenging to people so that we can get kind of like a bigger spread of activation.

Stephen Wilson  27:20  
Okay, that makes a lot of sense. Okay, so, then, you know, like, many, like, all of the studies that come from Fedorenko lab, you then have these functional localizers and I did talk to Ev on the podcast at the start of this year, but you know, just to kind of like make it make this conversation self contained, can you kind of tell me, like, the motivation for using a language localizer and then a multiple demand localizer? Can you tell us how you localize those two networks, and what the motivation is for doing that?

Cory Shain  27:53  
Right. So the goal, the ultimate goal is to ensure that we are comparing functionally comparable units across individuals in the study. So the, lots of lots of work by Ev lab and others have shown considerable degree of anatomical variability in function, especially for high level functions like language processing. So the particular brain sites even in like an anatomically normalized brain space, that are most engaged in a particular kind of broad anatomical area, and language processing, they differ slightly in my brain, and in yours. So if we treat coordinates as comparable, which we implicitly do, when we average across them, in order to get kind of like group averages, then we're looking not only at how much activity or activity there is, in, in response to a given kind of contrast and in brains, but how much alignment there is spatially between that activity and if that assumption if the degree of alignment is different for different kinds of tasks, or in different parts of the brain, then we may end up either masking effects that are real but just kind of have a different distributional spatial pattern in different veins, as well as potentially conflating effects that are distinct so different functions that are coordinates that perform one function at one rate in a different function, different grains being averaged together and treated as the same thing. So the way that we avoid this is by localizing the most language responsive pieces of cortex and each individual brain within broad anatomical masks and then averaging the activity in those functionally selected regions within the mask allowing us to get a comparable measure across brains of how anterior temporal lobe is responding to a particular manipulation.

Stephen Wilson  30:19  
Right. And um, so you, in this particular study, you have a written language comprehension localizer, so it's listening, it's reading sentences versus reading lists of pseudowords?

Cory Shain  30:36  
Right. 

Stephen Wilson  30:36  
And interestingly, you then use the negative of that contrast to define the multiple demand or MD network and I guess we're probably going to call it MD network. So let's just remember that multiple demand. Isn't that weird that you can, that you can, like map out the MD network quite effectively using the negative of language versus pseudoword?

Cory Shain  30:58  
Um, I don't know if it's weird.

Stephen Wilson  31:02  
I just think it's a weird thing. I mean, I don't disagree that that contrast identifies the MD network, I just think it's kind of surprising.

Cory Shain  31:09  
Yeah. I mean, the empty network responds to tasks that are hard. So encoding representing pseudowords, is harder than what it was

Stephen Wilson  31:22  
What are the participants have to do with the pseudowords in the localizer. Do they have to do something demanding with them?

Cory Shain  31:29  
In a subset of cases, we did have a memory probe and then in others, we didn't. It turns out not to make a difference. So similarly, it kind of makes sense that if you have to remember pseudowords, it's gonna be harder than remembering real words, because you have no long term memory for them. But even in the case of passive comprehension, it's it still yields the same pattern.

Stephen Wilson  31:56  
You must admit, that seems a little bit weird. No? I mean, I don't mean that as a critique. I just mean, it's strange that the negative of language, I mean, that, you know, you get that, even when they're not asked to do something demanding with them?

Cory Shain  32:16  
Um, yeah, I think. Yeah, I guess I haven't thought about it in terms of like, whether or not it's weird, I just assume that it has to do with encoding difficulty, but like, you just have no representation of this thing.

Stephen Wilson  32:32  
Yeah. I mean, well, who are the subjects here? I mean, these MIT students that are going to be like inherently trying to do something creative with the pseudowords, or is it just like, you know, regular Joe Bloggs?

Cory Shain  32:42  
Many of them are MIT students. 

Stephen Wilson  32:43  
Okay, well that could explain Yeah. Okay, that explains what then I don't know if you can replicate this outside of Cambridge, Massachusetts. (Laughter)

Cory Shain  32:53  
That's possible, so I will emphasize that our finding doesn't critically rely on use of this localizer. 

Stephen Wilson  32:58  
Yeah, I know. I mean, you do an alternative analysis, where you have a spatial working memory based localized, right? And you see essentially the same thing. So yeah, I just mentioned this as a curiosity. 

Cory Shain  33:10  
Yeah and it and the the use of the flip localizer is based on prior work that shows a really strong alignment between the voxels that are picked up by, both language localizer voxels that are picked up by specifically MD oriented localizer. I will also say because I'm not sure if they'd like fully like fleshed out, like, why we're looking at MD or what it is, so that the this is the language network versus the MD network and the use of functional localization is important for this third point that we brought up earlier about the domain specificity of language processing, and working memory for language. So the multiple domain network is this broad frontoparietal network that is implicated in all sorts of kinds of difficult tasks across input modalities, and task designs. So, it seems very broadly a domain general, and one of the one really good way of localizing it is using working memory. So it seems to support working memory. There's quite a bit of evidence for this. So we selected it as a kind of comparison case, because it seems to us the most likely home before domain general working memory and therefore working memory demands in language processing should register there if, if domain general resources are involved. 

Stephen Wilson  34:23  
Great. Yeah, I think that's, it's good that you clarified that because that's a pretty critical part of your data. 

Cory Shain  34:28  
Right.

Stephen Wilson  34:31  
 Cool. So, Okay, so you've got the language network mapped, you've got the MD network mapped, in individuals, you've got these stories that everybody's listened to. So they've got like, time by time varying, you know, differences in linguistic demand of different sorts and so what you end up doing is fitting models to the signal in these two networks as a function of many different explanatory variables. Can you tell me about those explanatory variables that go into those models. I mean, especially the critical ones.

Cory Shain  35:07  
Yes. So there's that. Yeah, there's many controls, actually, and many different potential explanatory variables, and, yeah, so I'll start with the controls is kind of similar. So there are other things besides working over demand that drive activity and language regions that have been established by prior work. And so, we followed some earlier studies in our in our lab and kind of selecting among those, including things like the words like relative frequency, and the average sound power as a kind of like a control to make sure that we're not getting low level auditory effects, which I think that those effects are quite weak. But then

Stephen Wilson  35:49  
They'd be weak in language regions, but they wouldn't be weak in primary auditory cortex. (Laughter)

Cory Shain  35:55  
Yeah, sorry, weak in the regions that we're looking at. 

Stephen Wilson  35:58  
Right. 

Cory Shain  35:59  
No, they are quite strong in auditory cortex. Right, but it is just, it, all I was saying was that the localizer has worked to really focus our analyses in on areas that are responsible for high level language processing, right, auditory processing.

Stephen Wilson  36:14  
For sure. 

Cory Shain  36:16  
The Yep, so the the kind of important controls, or like the ones that we were most interested in, in the study are related to Surprisal theory that I talked about earlier. Because this, I think, is one of the big potential like players and the mixed results that we've seen between working memory demand in some experiments, and not in others. Because in a lot of cases, the surprisal is not controlled for and one of the big kind of arguments at the development of at the outset of surprisal theory was that there's a strong relationship between how demanding many words are thought to be in terms of memory and how predictable they are from context. In part, because the dependencies that are thought to be constructed by memory also make influence the predictability of words. So if you want to claim kind of like working memory, a costly working memory operations as a core component of language comprehension in typical settings, it's important to use naturalistic simulations, important to control for surprisal as an alternative explanation. So we used a couple of different measures of surprisal that had shown, been shown to be strongly predictive of language network activation in a prior study, that is five gram surprisal. This is just like a string level like how likely is a word to occur given the four preceding four words that preceded it estimated from very large corpora as well as probabilistic context free grammar surprisal. So surprisal, estimated by a system that is only able to generate word predictabilities by averaging over the hypothetical trees that could, parse trees that could be used to analyze the sentence. Yeah, so those those are, we supported based on prior study and then we additionally included to kind of like really make the case airtight, very strong, deep neural network, Surprisal estimated by a powerful deep neural network, a recurrent neural network developed by van Schijndel and Linzen, 2018, with some particularly cognitive, cognitively motivated components related to kind of like adapting to the local statistics of a given conversation. And, yep, so all three of these different variants of surprisal were included and so in order for us to kind of like say that working memory effects are there, those effects needed to explain patterns over and above patterns explained by any of these controls,

Stephen Wilson  38:54  
Right, and you've got the most, sort of up to date, best appraisal measures from the literature in order to explain as much variance as possible with that approach, and then you're going to go, and then you're going to see what you can do above and beyond that with.

Cory Shain  39:08  
That's it. That's, the like the general idea. I won't say that because like language modeling is very quickly moving field so this is a constantly moving target. So we're using a model from 2018 has already been like, it's actually a pre-BERT model. So there's other ways that we could have cashed out surprisal. This one has been kind of cognitively evaluated in a way that many of the other kind of like large scale industry models haven't been. But yeah, the general idea is to kind of make the, give as much, give as much to surprisal as we can before kind of like attributing effects to working memory.

Stephen Wilson  39:42  
Cool. So then you have you know, in each model, you have predictions that are made by various working memory models, and you investigate three different working memory models. The most successful of which not to steal your thunder is the dependency What's it called dependency locality,

Cory Shain  40:02  
Dependency locality theory.

Stephen Wilson  40:03  
Theory from Ted Gibson, your, your co-author on this paper, one of your co-authors on this paper. And it was interesting to me reading your paper because this was the most fleshed out part of the whole thing. Like, you know, I'd say half of the words, half of the length of this paper is in the detailed descriptions of these three different parsing slash memory storage, maintenance, retrieval integration models. So, yeah, can you tell us in a kind of like the way you might tell your grandmother, unless she happens to be like a pioneer of psycholinguistics! (Laughter) You know, what is the the gist of these three different parsing models that you investigated?

Cory Shain  40:42  
Yeah, that's a great question. And yeah, you're right. There's a lot of I mean, like I said, I come from a psycholinguistics background.

That was very clear in reading this paper. (Laughter)

This is a bit of a kind of a community internal conversation that's been going on for several decades. But yeah, so the, the basic idea I already kind of fleshed out and that is that it's costly to retrieve stuff from memory and it's costly to store stuff in memory. But the particular costs, predicted costs and the mechanisms that underlie those costs are different according to different theories of language processing, and of memory and language processing. So, dependency locality theory is one of the earliest kind of I would say broad coverage, theories of memory use during language processing. And by broad coverage, I mean, it gives you like a word by word measure, like it's not restricted to like a particular qualifying critical region, but it actually is like a computational theory of like, what memory is doing at each word in the parsing process. And, and so it takes a look at the dependency structures among words. This is somewhat different than the phrase structures that I've talked about before in the context of like probabilistic context free grammars, for example, where dependency locality isn't, isn't really so much concerned with building syntactic phrases as it is with relating words to each other according to the syntactic structures. So like the example I gave earlier about relating the subject of a verb, to the verb itself, that's a syntactic dependency. That is a syntactic and semantic dependency. And so the idea is that you, you have to build it incrementally, because you haven't seen both ends, both endpoints of the arrow. So in dependency locality theory, when you hit the subject, you've got to you've got to store that, like as a thing that will be needed later to build a dependency. So there's a storage cost associated with that with constructing and complete dependency. And then when you hit the verb, you have an integration cost associated with going back to find the correct subject. And the storage cost is just like a linear count of like, how many dependencies can be basically unambiguously predicted to be incomplete on the basis of like what I've already seen in the sentence. And then the integration cost is a function of how many potential retrieval targets intervened between the retrieval cue, say the verb and the subject of the verb. So, yeah, so as basically as more linguistic material intervenes, in dependencies, the harder the dependencies become to compute because the harder it is to retrieve the verb or the the antecedent I guess, of the dependency, okay. There's different kinds of ways that you can implement these costs, we considered several different implementations and settled on one that is slightly different from the originally proposed theory, but I think is is motivated. The second one, we considered is ACT-R, Adaptive Control of Thought, Rational, which is very broad theory of kind of basically the, like, cognitive control in in the mind, developed by Andersen, but that was applied by Lewis and Vasishth in 2005, to language processing, using constructs from the ACT-R theory to kind of formalize a mechanism for language comprehension. And it has a lot of similarities with dependency locality theory, in that you have things in memory that need to be retrieved. The critical difference is like how difficult things are to retrieve. So, for example, the some, some things may have been kind of like, if you built a dependency to say, say, you observed a word a while ago, but then you recently built a dependency to that word, or had to retrieve that word, then that word will have been reactivated. So it would be more accessible than you would expect, according to the dependency locality theory, if you were only looking at kind of like linear distance between the words. So the expected integration costs differ and there's also like a continuous time decay function assumed by ACT-R, where stuff in memory just decays with time, and that's going to govern how difficult it is to retrieve, more activated things are easier to retrieve. And then the ACT-R  parsing framework doesn't assume any storage cost at all. There's just like this one pool of like distributed associative memory where everything is assigned like this kind of like content based key in memory and retrieval is just function, it's just a matter of like, keying that pool and getting out the most, the best fitting item and memory or set of items. And if you have a lot of things that are have been pushed into that pool, which has kind of like a, like a finite dimensionality of its key, then there's gonna, there's gonna be more collisions, and therefore more difficulty with retrieval. But you don't have to actively maintain anything in memory, according to this theory and so you wouldn't get storage costs. 

Stephen Wilson  41:35  
Okay. 

Cory Shain  41:46  
And then there's another variant, left corner parsing, which like Lewis and Vasishth, is a more algorithmic level description of what kinds of compositions and operations are involved in processing sentences word by word. And it can be used to derive a lot of different potential memory effects, when we consider many of them in our study, but storage costs and integration costs can all be derived from left corner parsing. But the basic idea of left corner parsing is that you're kind of like, you're you're both trying to recognize the syntactic features of the word that you're seeing, and integrating those that that that kind of bottom up signal with a top with top down expectations about the category that you think you're parsing now. And according to like, the analysis that you make it each word you have to, you sometimes have to commit multiple kind of distinct fragments of analysis, to working memory, and kind of hold them as separate items to be joined later. 

Stephen Wilson  46:58  
Uh huh. 

Cory Shain  46:59  
That's where memory costs derive from in these theories. So we consider these three broad theories and then within each one, we considered, in some cases, multiple different variants of them, I think were total 21 different ways of cashing out theory driven memory demand. And the goal, the reason for doing some of the other things was that we weren't really interested in a particular formalism for working memory, but like, working memory broadly, like, what what can we find any evidence of it? And so yes, we didn't want to like, yeah,

Stephen Wilson  47:29  
Right. Yeah, I mean, your paper wasn't about like, comparing these three different working memory models or variants. But, you know, in order to find an effect, if there is one, then you want to kind of explore all of the state of the art models and see if any of them are going to work. 

Cory Shain  47:45  
That's basically it, yeah.

Stephen Wilson  47:46  
So, let's get to the most important part of the whole discussion, you fit the model to the data to the language network, and to the MD network, you fit all these different surprisal, you get all the surprisal terms in the model, and then you putting in sort of consecutively putting in predictions derived from the 21, or however many different I lost count when I was reading the paper, working memory models. So then, okay, so what do you what do you find in these two big networks?

Cory Shain  48:15  
every customer isYeah, there's kind of like two components to the analysis, because there's so many things, there's an exploratory phase, or just looking at like, do the memory predictors improve model fit in the data that the models are trained on. And so that ends up selecting out the dependency locality predictors as being quite strong descriptors of like processing patterns. And like really, no other predictor showed strong evidence in the in the exploratory analyses. So we use the kind of the best DLT integration cost measure as well as the storage cost method, both of which showed some evidence and exploratory analyses. In the MD network, no variable showed a significant integration cost and exploratory analysis. So we basically just kind of like set MD aside at that point, it just didn't seem to be responding to any other measures that we, that we were looking at.

Stephen Wilson  49:16  
Cool. So I'm going to try and to say it, and I'm going to try and say the same thing in my my way. So you found that in the language network, you basically could explain more of the time varying signal while people listen to these narratives. By including above and beyond surprisal, you included predictions of cost, according to the dependency locality theory, which derives from Ted Gibson's work. And that worked much better than the other working memory models that you investigated and you didn't see any, you didn't see anything in the MD network, so it didn't, in the multiple demand network you didn't see any effect of including this working memory. cost measure, which kind of suggests that that's not where it's happening. So, you know, one of your other co authors, William Schuler is also, like involved in the development of one of the other classes of models. You investigated, right? Did, was there any kind of like, you know, bitter, you know, ill feelings, because one of your co authors model ends up fitting the data better than the other?

Cory Shain  50:23  
That's a great question. I'm not gonna put words in his mouth. I mean, I think he would have been happy to find, like left-corner effects and if we had, then we would have probably more work to do to try to, like, tease apart the relationships between the different theories. But, yeah, William isn't a physicist and he, he goes with with what the data said. So I'm not exactly sure he hasn't talked to me about like, what kinds of model revisions might this outcome might require. But one of the things that we allude to as future work in the paper is like trying to figure out what is special about the DLT, that seems to again, we didn't directly compare models. So we can't really say the DLT is better, but like, it produces results here, and the other ones don't and we'd like to better understand that, linguistically. 

Stephen Wilson  51:13  
Cool. 

Cory Shain  51:13  
Because if we did that could potentially help with providing more algorithmic level models, because like luck on a person just kind of specify a procedure by which, like, a computational procedure by which people understand sentences in a way that the DLT kind of like leaves unspecified.

Stephen Wilson  51:26  
Right. Okay, so can we talk about how how this finding, based upon the three questions that you started out with, so, you know, does it provide support to the notion that we engage in word by word structure building, rather than just kind of being more heuristic in our interpretation of incoming words?

Cory Shain  51:48  
Yeah, that's a good question. So I will say also, that we then had a confirmatory component, which is earlier where we we take the pre-trained models that were kind of selected during exploratory and like, test their predictive accuracy on half of the data, which we held out from training and the effects are, are significant. So it does generalize to test that. So yeah, so we do kind of claim the existence of sorry, at least integration costs generally sort of comes down. So integration costs, computer registering the language network, and not in the MD network over and above surprisal. So I think that it's,  this outcome with respect to like, detailed word by word syntax analysis, I think our outcome does support that to an extent, at least indirectly, by supporting the existence of effects derived from theories of working memory, that whose predictions necessarily derived from syntactic structures. So I'm not aware of like theories of working over and language comprehension that are kind of like syntax agnostic, like you kind of have to assume a syntactic representation that's being built and the parsing algorithm that builds it, and only only then can you get a derived working number of predictors. So the finding of working memory supports at least a computational procedure in the mind that follows these syntactic dependencies in the way that the DLT says. Now, it's correlational. So there could be some deeper explanation kind of like non syntactic correlate of the DLT, that might explain these away. But right now, I think the non syntactic correlate, that would be most likely to do that would be something derived from like, syntax agnostic surprisal measures, which we included, and it doesn't explain it away. 

Stephen Wilson  53:43  
Right. 

Cory Shain  53:44  
So for the time being, this seems like fairly strong support for 

Stephen Wilson  53:46  
Yeah.

Cory Shain  53:47  
detailed word by word analysis.

Stephen Wilson  53:49  
And that kind of gets to the second, you know, claim too, right? That this is due to memory operations, rather than just surprisal, because you do have the best appraisal measures that were at hand in the model. So it's explaining above and beyond that.

Cory Shain  54:08  
Yeah, exactly. So it doesn't seem, it seems to be the case that structure building may be costly.

Stephen Wilson  54:12  
And the effect size is quite significant. I mean, okay, so it's not just that it's statistically significant, but like, you kind of talk about the amount of variance, it's being explained and you do that with reference to this concept of ceiling of like, the amount of variance that could potentially have been explained. Can you explain like, can you explain that approach? And then kind of tell us like, how much of that how much of that so called ceiling explainable variants were you able to explain?

Cory Shain  54:36  
Yeah, I honestly don't remember the numbers, so I may not be able to 

Stephen Wilson  54:42  
okay, I'll look it up while you tell you, while you told me, you tell me what ceiling how ceiling works and I will look 

Cory Shain  54:46  
The idea is that like, not all of the brain signal is explainable according to the, on the basis of the stimuli alone. There's a lot of other stuff going on in the brain, even in the language network and the source of that information is not yet clear. So a good way of kind of figuring out how much we could possibly explain of the stimulus driven signal is to look at Inter-subject correlations. So like, how much do the signals in the corresponding regions covary across individuals. So we do this just with a regression model where we take the average of everybody else's time courses in each region, as a as a regressor to predict the time course, in any individual subjects, any region of any individual subject. So that gives us a kind of like a ceiling correlation measure that we that's kind of the best that we could achieve only attending to the the language stimulus. And that's helpful because otherwise it kind of like helps. It's difficult to interpret the these correlations. So, yeah.

Stephen Wilson  55:57  
Right, okay. Yeah, I mean, this is a really neat concept and, I mean, I hadn't thought I mean, the first time I read about it, I was like, oh, that's, I mean, which was not in your paper, but it was in I mean, I've seen it in other papers, too. 

Cory Shain  56:06  
Yes. 

Stephen Wilson  56:08  
Yeah, but um, you know, it's interesting, it's like, I'd never really occurred to me, like, you're never going to be able to explain more than what you could explain by taking the average of other subjects listen to the same narratives. Because, in principle, you know, these, that what you're trying to fit it with is not doesn't have anything unique to that individual. Right? It's like, it's not like, you know, maybe individuals do differ in the their working memory cost as a function of time. But you don't have any model of that, right? You only have a model of you know, your model is not individually specific. So, you know, you've kind of got this inherent, that's how much you could explain. Okay, so I'm looking at your table five, and it says, I guess R relative is maybe the ceiling, you've got, like a one. And then for training model, you've got 0.647. In the language network, and for the evaluation, it's 0.389. So does that mean that your, your R value is about 40% of what it could have possibly been in the best possible case?

Cory Shain  57:03  
Right, exactly. Yeah. So why aren't we trained on

Stephen Wilson  57:05  
out of sample data? So the main point is, this is not just like a trivial tiny effect, like you're explaining. It's pretty decent chunk of

Cory Shain  57:16  
 Yeah.

Stephen Wilson  57:16  
 It's very impressive, but not in the MD network. I guess another question I had about the MD like, you know, you then the if it was general, working memory, so your I guess let's first of all, just answer your question three. Right. So domain specific versus domain general working memory,  obviously, your, your results indicate that this working memory is domain specific, right? 

Cory Shain  57:40  
Yes. Yeah. 

Stephen Wilson  57:43  
To the extent that the MD network is the best possible candidate for domain general effects, and I would ask one question about that, which is the memory in question, if it was going to be domain general would be auditory in nature? Right? It would be like auditory working memory? It is MD necessarily going to be sensitive to that kind of working memory? I mean, do you do you know, of past literature on that?

Cory Shain  58:12  
That's a good question. My understanding is that it is.

Stephen Wilson  58:22  
It's, it seems logical, it seems logical. Yeah. I mean, it seems logical that it should be. But if it's

Cory Shain  58:30  
Or at least, like verbal, like verbally mediated working memory tasks, 

Stephen Wilson  58:33  
Yeah. 

Cory Shain  58:35  
will register effects in an MD.

Stephen Wilson  58:39  
Yeah, it's just that like, all of you know, all of the, the, you know, like, you know, Ev has this study from 2013, where she shows these MD effects in like seven different tasks, but in different domains, but they're all visually presented and it makes me wonder, like, you know, whether the it could go different in a different sensory modality?

Cory Shain  59:02  
Yeah, that's a great question. I, I, I'm gonna have to not answer because I'm not I'm not sure on this one, but, yeah, I,  my understanding is that this has been replicated across the tensor modality, it's not critical for getting these effects in these regions. 

Stephen Wilson  59:17  
Okay. Yeah. That would be that would be the the assumption for sure.

Cory Shain  59:24  
Yeah, we're definitely making that assumption.

Stephen Wilson  59:26  
Yeah. Cool. So you kind of answer your three questions. Were you surprised by I mean, I think that you were maybe were you surprised or did it all come out the way you had expected?

Cory Shain  59:41  
I was surprised to find such strong memory effects over and above surprisal. I kind of like and then like, I guess, like, I don't know, I'm a linguistically trained like researcher and so like, the idea that these constructs from linguistic theory like are just not not cognitively present is it's hard for me to swallow. I'd like, like, these kinds of more structurally mediated memory models of processing. But then like surprisal was just kicking their butts into all sorts of different empirical evaluations. So yeah, I was quite surprised at the the strength of memory effects over and about really rigorous appraisal controls. As far as domain specificity,

Stephen Wilson  1:00:22  
sorry, I'm just gonna, before you get that, yeah, I think it's very impressive. And I think that, you know, it really speaks to that quality, like psycho linguistic work that that you did in this paper, you know, like this the care and sophistication of the way you tried out 21 different variants of three different broad approaches. I mean, I don't think if you I think if you just came out this naively with, you know, some interpretation of Ted's model, it probably wouldn't work off the shelf. You know, like, I suspect that all that work you did that makes up, like I said, half the words of the paper. 

Cory Shain  1:00:55  
Yeah.

Stephen Wilson  1:00:56  
Probably is responsible for its success.

Cory Shain  1:00:58  
Thanks. That's good to hear. Yeah, no, it was a good deal of work and um, the Yeah, I mean, that, that was the ultimate goal is to kind of make this I mean, it's impossible because of how representationally dependent all of these theories are, it's impossible to to come up with like a theory-neutral measure of working memory demand and so the best thing that we came up with was just like trying as many different theory-driven measures we could find.

Stephen Wilson  1:01:20  
Yeah, cool and so you're gonna tell me why you were surprised about it being in the language network and not the MD network?

Cory Shain  1:01:27  
Yeah, well, I'm a member of Ev lab. So, I have priors on this. But there's just like, yeah, like this accumulation of evidence that most of the computation involved in language processing is located in language selective cortices. So, we had already done this surprisal paper in 2020, where we use the same data set in a similar kind of regression design, to analyze, kind of like syntax selected, is it sensitive versus syntax non-sensitive measures of surprisal and these also didn't register in working memory, so this idea of like, like working memory in MD, so MD wasn't seen didn't seem to be involved in the predictive processes at play in language comprehension. I think that they're stronger. there was stronger theoretical reason to think that it might be involved in working memory. But I guess, yeah, given how difficult it's been to show, linguistically driven effects, and the in prior studies, this if there was going to be a memory effect, I guess I would have place my bets on the language network.

Stephen Wilson  1:02:30  
Yeah. I think I would have to having read the prior papers from you and your collaborators. Cool. Yeah. So I think we kind of talked about, like, the main gist of the paper. I mean, there's, there's other subtleties in there that readers can delve into. 

Cory Shain  1:02:51  
Yeah.

Stephen Wilson  1:02:52  
But I think, unless you're deep, deep, deep in computational psycholinguistics, I think we hit on the main points. Yeah?

Cory Shain  1:02:59  
I agree. Yeah. Yeah. No, this is a great conversation. I think we've covered everything. Yeah. I wanted to say what the paper. Yeah.

Stephen Wilson  1:03:06  
Well, thank you very much and I, I hope we had to catch up in real life at some point.

Cory Shain  1:03:13  
Yeah. Or at least virtually at SNL. Couple days. Oh, yeah.

Stephen Wilson  1:03:16  
Well, we probably will. But that doesn't, I don't know, that doesn't really fully count me at least.

Cory Shain  1:03:20  
Just like this. (Laughter) Yeah. 

Stephen Wilson  1:03:22  
Yeah. But I guess for your listeners, if you're interested, I'll be talking about this work there. So, will be by the 10 slides plan session B. I'll be there.

Stephen Wilson  1:03:30  
Oh, cool. So it's one of the slides sessions? Slide session B? Okay. I'll put that in the intro. and yeah, I guess I'm gonna try and get this out before SNL next week, but I that depends on the behavior of my children. But I think it would it would be good because we talked about the conference.

Cory Shain  1:03:49  
Yeah, that makes sense. Yeah, I love that little like qualifier. I need to put that on all the things that I promise today.

Stephen Wilson  1:03:59  
All right. Well, it's great talking to you. Great meeting you. 

Cory Shain  1:04:01  
Yeah, it's really nice to meet you, too. Thanks so much for inviting me. It was a lot of fun.

Stephen Wilson  1:04:04  
It was my pleasure. Take care. 

Cory Shain  1:04:06  
Thanks. You too. 

Stephen Wilson  1:04:07  
Bye.

Cory Shain  1:04:07  
Bye. 

Stephen Wilson  1:04:08  
Okay, well, thanks for listening to Episode 16. As always, I've linked in the paper we discussed as well as Cory's website on the podcast website at  langneurosci.org/podcast. I hope to see you all at SNL. Bye for now.