Mapping the neuronal building blocks of human language with language models

Nature

TechnologyNature·Jun 17, 2026·53 min read

Mapping the neuronal building blocks of human language with language models

Main

Language is a fundamental cognitive process unique to humans that allows us to communicate complex and highly diverse meanings. It also enables us to construct new and potentially inexhaustible expressions through grammatical rules that govern how we arrange and combine words and that can generalize across sentences beyond simple repetition. Linguistic studies based on behavioural observations have accurately described how we use the properties of words, such as their grammatical function and syntactic relationships, to construct unique sentences1,2,3. They have also described how we use the hierarchical structure of phrases and sentences to communicate specific ideas and thoughts6. Recent neuroimaging7,8,9,10,11 and electrocorticographic recordings5,12,13,14,15, in turn, have identified broadly distributed regions across the frontotemporal cortex that reliably engage in speech production and sentence construction16,17,18,19 and that differentiate grammatically well-formed sentences from unstructured word lists or sounds, suggesting their sensitivity to syntactic structure20,21. They have also found neural responses that reflect the merger of words into phrases6 and their semantic properties22,23, revealing a broad macroscopic network of cortical areas that could support human language.

However, understanding the microscopic organization and cortical landscape by which linguistic information is encoded by neurons in humans or the cellular processes that underlie natural speech has remained a longstanding challenge. While recent investigations have revealed how the phonetic components of words are encoded by neurons24,25, they do not reveal the cellular processes by which we produce meaningful speech or through which we arrange and combine words into phrases and sentences. To convey meaning through language, for example, humans use abstract grammatical categories (such as adjectives and nouns) and dependencies (such as the relationship between nominal subjects and objects) that can generalize across sentences and that can describe complex relationships such as actions and outcomes. Yet, how such grammatical features are encoded by neurons or whether they generalize across sentences largely remains undefined. Little is also known about whether neurons can represent the higher-order syntactic structure of sentences or how they encode their constituent phrases.

Another prominent question in neurolinguistics is whether syntactic information is dissociable to some degree from that of semantic information, or how these core aspects of language are represented at a cellular scale26,27. Whereas previous imaging studies have suggested that linguistic information is probably represented broadly across the brain28,29, little is also known about how such information is distributed across cortical regions or whether speech processes are lateralized at a basic cellular level. Finally, little is known about how the local activities of individual neurons and their tuning properties relate to those of the populations’ broader field potential patterns, or what their local organization within the cortex may be.

Here we performed single-neuronal recordings across the human frontotemporal cortex and tracked their action potential (AP) and local field potential (LFP) activities over long-term durations as participants produced natural speech. Using speech tracking, parsing, modelling and decoding techniques, we describe the detailed organization and encoding properties of the neurons. We also describe their distribution, regionality and lateralization, together providing a detailed examination of linguistic information encoding during language production at a combined micro (cellular), meso (local population) and macro (regional) scale.

Recording neurons during natural speech

Recordings were obtained from neurons across the human frontotemporal cortex using semichronically implanted microelectrode arrays (96 channel configuration)30,31,32. These arrays were implanted as part of planned neurosurgical care for epilepsy monitoring30,33,34 and were located in the frontal35,36, anterior temporal20,37 and posterior temporal38,39,40 regions that have been shown to reliably engage in speech production and sentence construction41,42,43 and to display robust language-selective responses41,42,44 by validated language localizers28,41,45 (Fig. 1a). They were also placed in participants who were awake, after implantation, and were therefore capable of producing natural speech, together providing a rare opportunity to study the activities of individual neurons during natural language production (see the ‘Microelectrode recordings’ section of the Methods). In total, we recorded from 579 putative neurons in 8 participants (3 female, 5 male, aged 27 to 52 years) and 14 sessions and only well-isolated single units with stable waveform morphologies consistent with those of cortical neurons were used (Fig. 1b and Extended Data Fig. 1a).

During recordings, the participants produced various phrases and sentences during natural speech42,43,46,47,48 (see the ‘Language production and audio recordings’ section of the Methods). The sentences were also constructed de novo during natural language production (rather than simply read or repeated) and differed broadly in context and structure (Fig. 1c). For example, the participant may construct a declarative sentence such as “They came home for the weekend” or they may produce a sentence with a subject–verb inversion, such as “When is the payment due?” (Extended Data Table 1a; see the ‘Language production and audio recordings’ section of the Methods). Together, the participants produced 10,460 words across 1,895 sentences, for an average of 747 ± 168 (mean ± s.e.m.) words and 135 ± 28 sentences per neuron per session. All words produced by the participants were transcribed using a semiautomated platform and were aligned to each neuron’s AP activity (see the ‘Language production and audio recordings’ section of the Methods). The participants were English speakers and displayed normal language function on preoperative testing (see the ‘Participants’ section of the Methods).

Linguistic representations by neurons

Neurons within the population responded selectively to the linguistic properties of the words. Natural language processing models such as context-sensitive constituency parsers allow for some of the core linguistic properties of words to be reliably labelled and tracked within natural speech1,2,3,49. Thus, for example, whereas the word’s part of speech reflects its grammatical properties (such as an adjective), the sentential constituents reflect how the word or group of words functions as a grammatical unit (for example, within a noun phrase). Furthermore, whereas the word’s constituency depth describes its hierarchical syntactic relationship to other words in a sentence, its ordinal position describes its rank order (Fig. 1d,e, Extended Data Fig. 1a and Extended Data Table 1b). By tracking the neurons’ AP activities in relation to each planned word (pre-articulation period, −400 ms to −100 ms from word utterance; see the ‘Single-neuronal analysis’ section of the Methods) and by using a constituency parsing approach (see the ‘Sentence parsing and linguistic feature labelling’ section of the Methods), we find that 9.2% (n = 53) of the neurons responded selectively to the words’ parts of speech (one-sided Mann–Whitney U-test, α = 0.05, P values were adjusted with false discovery rate (FDR) correction for multiple feature comparisons; Figs. 1f and 2a–c)—for example, transiently increasing their firing rates when the upcoming word was a modal verb (Fig. 1f and Extended Data Fig. 1b). By contrast, other neurons responded selectively to the sentence’s constituents (16.2%, n = 94), whereas 15.9% (n = 92) responded selectively to the word’s constituency depth (distance from the root of the constituency tree) and 11.1% (n = 64) to closing level (phrase mergers). Together, these proportions of neurons were significantly higher than expected from chance (χ2 test, P < 0.05) and remained consistent when randomly subsampling words (bootstrapping test, n = 100, P > 0.05 for all features; Extended Data Fig. 1c). They were also consistent when tested across statistical techniques such as multiple regression analyses and sequential feature selection that accounted for collinearities between features (χ2 test, P > 0.05 with Bonferroni correction across comparisons; Extended Data Table 1c), together suggesting that these neurons reflected the linguistic properties of the words.

These linguistic representations by neurons were also robust. Using an L2 regularized balanced multinomial logistic regression classifier to predict the linguistic properties of words not used for model training (70:30 split into training and testing; see the ‘Decoding linguistic features from neural activity’ section of the Methods), we find that all of the linguistic features that were tested could be reliably decoded from the neurons’ activity patterns on a word-by-word level (one-sided permutation test, P < 0.05; Fig. 2d). Thus, for example, the activities of the neurons could accurately predict the constituents to which the words belonged when tested on randomly selected words or sentences not used for model training. The highest single-word decoding accuracies were observed for the words’ constituency depth (31.5% observed versus 14.3% chance; one-sided permutation test, P < 10−3, n = 1,000), followed by closing level (27.1% observed versus 11.1%, P < 10−3), constituents (25.1% observed versus 10.0% chance; P = 0.005) and parts of speech (18.4% observed versus 10.0% chance, P = 0.031), with slightly higher decoding performances being found when using observation-averaged activity patterns (Extended Data Fig. 2a,b). The linguistic properties of the words could therefore be reliably decoded from the activity patterns of the neurons.

Finally, given the above findings, we examined whether these linguistic representations by neurons were generalizable. By randomly sampling cells from the recorded population and by accounting for their sparse spiking using observation-averaged decoding, we find that decoding performances were similarly significant across all tested features (one-sided permutation test, P < 0.001) and that decoding performances consistently plateaued as the number of tested neurons increased (exponential fit; Extended Data Fig. 2c). We also find similar observation-averaged decoding performances when comparing neuronal activities across the different participants and recording sessions (Friedman test, P = 0.08) as well as similarly significant decoding when examined across all tested features (92% and 86% of all sessions and all participants, respectively; one-sided permutation test, α = 0.05; Extended Data Fig. 2d,e). Lastly, using a natural language processing model to further cluster the words based on variations in their sentence contents (see the ‘Generalizability of linguistic representations’ section of the Methods), we find largely consistent decoding performances for all linguistic features (multivariate linear model, P > 0.05; Extended Data Fig. 2f), suggesting that these linguistic representations by the neurons were not only robust but also generalizable.

Specificity of linguistic encoding

In contrast to constituency parsers, whereby a sentence is decomposed into constituent units, dependency parsers principally describe the words’ grammatical relationships, which grow hierarchically based on the relationship between the head and dependent words (Fig. 2b and Extended Data Table 1d). Thus, using a dependency parsing approach (see the ‘Sentence parsing and linguistic feature labelling’ section of the Methods) to further test the specificity of neuronal responses, we find that 10.2% (n = 59) of the neurons responded selectively to the grammatical relationships (dependencies) between the words—for example, changing their activities based on whether a planned word will be the direct object or nominal subject of a governing word. Furthermore, 12.1% (n = 70) of the neurons responded selectively to the word’s dependency depth (Fig. 2c and Extended Data Table 1c), reflecting the hierarchical relationship between the head and dependent word. Overall, both the dependency relationship and depth could be reliably decoded from the neurons’ response patterns (23.2% and 27.2% observed single-word decoding versus 10.0% and 11.1% for chance, respectively; one-sided permutation test, P < 0.05; Fig. 2d), suggesting that these features were represented reliably.

Given the above findings, it is possible that the neurons’ responses may be partly explained by overlapping word features—for example, a word can be both a noun and belong to a noun phrase. However, when we evaluated the specificity of the neurons across all possible feature pairs, we found that only 2.2% of them displayed an overlap. Among all the pairs, only parts of speech and dependencies displayed a significant overlap across tests of independence (2.6% of neurons; two-sided permutation test, P = 0.028; χ2 test, χ21 = 18.8, P = 4.1 × 10−4; Fisher’s exact test, odds ratio = 4.32, P = 2.1 × 10−3; Bonferroni corrected for multiple comparisons). We also found little correlation in the degree to which the neurons were modulated by parts of speech and dependencies when compared on a neuron-by-neuron basis (linear regression test, P = 0.18), further confirming that their responses were specific.

These neurons also appeared to display a high degree of specificity for features that were closely related. For example, even though the words’ constituency and dependency depths are often highly related, both of which increase with sentence progression, we found little overlap between cells that encoded constituency and dependency depths (one-sided permutation test, n = 1 × 105, P = 0.31 (constituency and dependency depth), P = 0.14 (constituent and dependence relationship); Fig. 2c,e). There was also little overlap between neurons that responded selectively to either the constituency depth or those that simply reflected the words’ ordinal positions (one-sided permutation test, n = 1 × 105, P = 0.33), which also grows with sentence progression, suggesting that these linguistic features were distinguishable at the cellular level.

Finally, we examined whether other lower-order aspects of speech, such as the sound-spectral properties of words, could have partly explained these neuronal responses. However, when tracking the words’ acoustic features, we find that, of the neurons that responded to variations in word pitch (12%, n = 72 out of 579 neurons), only a few (≤2.2%) overlapped with any of the neurons that displayed linguistic selectivity (two-sided permutation test, P > 0.1 across all features; Fig. 2a). We also found little difference in the proportions of neurons that displayed linguistic selectivity when considering other related features such as word frequency, length and surprisal (multiple regressions analysis, with added features; see the ‘Single-neuronal analysis’ section of the Methods; Extended Data Table 1c), suggesting that the neurons’ responses were indeed robust to variations in these lower-order speech properties.

Combinatorial linguistic encoding

Although the above findings identified neurons that reliably reflected the words’ linguistic properties, it remained unclear whether and to what degree they encoded their specific combinations. For example, even though the word ‘dog’ in the sentences ‘The dog ate the bone’ and ‘The man walked the dog’ is the same, the specific combination of linguistic features that describe it differs (although the word ‘dog’ is a noun in both sentences, it is a nominal subject in the former and an object in the latter). Similarly, although ‘dog’ and ‘bone’ are both nouns, they hold broadly different semantic meanings1,2,3,49.

Therefore, to further study the combinatorial encoding properties of the cell population, we constructed embedding models that could simultaneously capture these features and whereby each linguistic feature (for example, parts of speech, dependencies) was represented by a unitary value (Fig. 3a). Thus, for example, when producing the word ‘dog’ in ‘The dog ate the bone’, these embeddings would reflect not only the part of speech to which the word ‘dog’ belongs but also its constituency depth and serial order as a one-hot vectoral projection (see the ‘Syntactic and semantic embedding models’ section of the Methods). To further account for the semantic properties of the words, we also provided vectoral projections of the words’ meanings based on Word2Vec embeddings27. Lastly, to allow for valid comparison and to limit the possibility of overfitting, we dimensionally reduced the models’ embeddings using a principal component analysis to calculate their predictivity—a measure of how well the models’ embeddings can reliably predict the observed neuronal activity patterns (see the ‘Language model predictivity’ section of the Methods).

Here we find that the activities of cells within these populations could be reliably predicted by the specific combinations of linguistic features that defined the words (one-sided permutation test; P < 0.05 for all tested models), with the highest mean predictivity observed for models that included both syntactic and semantic information (mean predictivity = 0.015 ± 0.01 (mean ± s.e.m.); Fig. 3b). However, we also find that most individual neurons preferentially encoded either syntactic information or semantic information alone, with only 1.8% of the cells encoding both syntactic and semantic information for the features tested (binomial test with Bonferroni correction for multiple comparisons, P > 0.05 for all features). Thus, while many of the neurons were preferentially tuned to respond to specific linguistic features, they also reliably encoded information about the combinations of linguistic features that uniquely defined the words.

Finally, beyond encoding syntactic and semantic information about the words, these cell populations were also informative of the words’ specific sentence contexts. Here, using large language models24,25 (Vicuna; Extended Data Table 2; see the ‘Higher-order contextual embedding models’ section of the Methods) that could capture the sentences’ contexts (for example, ‘bark’ within ‘The dog has a loud bark’ versus ‘The tree has thick bark’), we find that the contextual model was more predictive of the neurons’ activities than models that simply included the words’ syntactic or semantic features (population predictivity, 0.05 ± 0.01 (mean ± s.e.m.); one-sided permutation test P < 0.05, n = 105; Fig. 3b). Notably, models that incorporated progressively longer context windows (from 2 to 10 words preceding the spoken word) were increasingly predictive of neuronal activity (plateauing at approximately 5 words; exponential fit; Fig. 3c), By contrast, predictivity was at chance when randomly shuffling the sentences in relation to neuronal activity (population predictivity at −1,000 ms = −0.01; one-sided permutation test, P < 10−5, n = 105; Fig. 3d) or when replacing the words spoken with randomly selected tokens (selecting an entirely new set of words; population predictivity = −0.03; one-sided permutation test, P < 10−5, n = 105; Fig. 3d), suggesting that the activities of these neurons reflected the words’ contexts. We also find that the predictivities for the neurons remained consistently high well before word onset (population and single-neuronal predictivity, one-sided permutation test, P < 1 × 10−5, n = 105; range −2,000 to 2,000 ms, tested in 100 ms increments; Fig. 3e), peaking approximately 1 s before utterance, and displayed a close temporal relationship between the neurons’ peak predictivities, their timings, model layers and linguistic selectivity (regression model, P = 0.02; Fig. 3f, Extended Data Figs. 3–6 and Supplementary Video 1). Together, these cell populations therefore appeared to provide richly detailed representations of the syntactic, semantic and contextual properties of the words before utterance.

Diversity and distribution of neurons

Collectively, these neurons were distributed broadly across the recorded cortical areas and hemispheres. Neuronal recordings were made here from frontal35,36, anterior temporal20,37 and posterior temporal38,39,40 regions shown to reliably engage in speech production28,41,42,43 and to display robust language-selective responses41,42,44 by validated localizers28,41,45 (Fig. 1a). By tracking their encoding properties, we find that the proportions of neurons that responded selectively to at least one or more features (289 out of 579) were similar in the frontal, anterior temporal and posterior temporal regions (χ2 test, χ22 = 0.87, P = 0.65). They were also similar when comparing the left to right hemisphere (χ2 test, χ21 = 0.51, P = 0.47; Fig. 4a), and there was little difference in the distribution of neurons based on the particular linguistic features tested (χ2 test, P > 0.05; aside from dependencies and constituents; Fig. 4a), suggesting that they were widely distributed.

However, the strength of modulation and the degree to which the neurons were informative of the linguistic features preferentially lateralized to the left hemisphere and differed across cortical regions. Specifically, when we examined the degree of modulation, which reflects the extent to which each neuron changes its activity in response to specific features (their z-scored change in activity; see the ‘Single-neuronal analysis’ section of the Methods), we found that neurons in the left hemisphere displayed significantly stronger modulation than neurons in the right hemisphere (n = 579; rank-sum test, z = 8.06, P = 7.7 × 10−16), including to lower-order word features such as pitch (rank-sum test, z = 4.53, P = 5.8 × 10−6; Fig. 4b (top) and Extended Data Fig. 7a,b). This difference between hemispheres was consistent across cortical areas (rank-sum test, P < 3 × 10−6), with the posterior temporal cortex exhibiting the most significant difference (rank-sum test, z = 8.32, P = 9.0 × 10−17). Furthermore, when compared across cortical areas in both hemispheres (Kruskal–Wallis test, H2 = 33.15, P = 6.3 × 10−8), the strongest modulation was found in the prefrontal cortex (rank-sum test, P < 0.05; for all features except for closing level; Fig. 4b (bottom) and Extended Data Fig. 7c). Neural predictivities from the contextual embeddings were also highest in the left compared to right hemisphere (rank-sum test, z = 3.3, P = 9.5 × 10−4) but highest in the anterior temporal cortex (Kruskal–Wallis test, H2 = 616, P = 1.6 × 10−134; Extended Data Fig. 7d,e). Thus, even though these neurons were distributed broadly across cortical sites, the strength of modulation and the degree to which they could effectively encode linguistic information appeared to be regionalized and preferentially lateralized to the left hemisphere.

Local cortical organization

Finally, we evaluated how the activities of these neurons related to their broader field potential patterns at the local cortical level50. Whereas APs reflect the activities of individual neurons (the basic computational units by which information is encoded), LFPs predominantly reflect the broader synchronous activity of the local neural population, including their synaptic and subthreshold membrane potentials. By tracking their corresponding activities on a site-specific level (with each electrode measuring less than 100 µm in diameter and approximately 5 µm at its tip), we find that many of the sites displayed linguistic selectivity based on their recorded LFPs. Specifically, 23% of the sites (n = 303 of the 1,344 electrode contacts across all arrays) displayed linguistic selectivity, showing consistent changes in LFPs to specific linguistic features (for example, parts of speech; Fig. 5). However, we also find that the proportions of LFP sites that displayed selectivity were significantly lower than the proportion of neurons that displayed selectivity (50%; n = 289 out of 579 across all linguistic features; χ2 contingency test, χ21 = 141.0, P = 1.6 × 10−32) and that most sites in which the LFPs were selective contained no selective neurons (233 out of 303 channels; χ2 contingency test, χ21 = 2.3, P = 0.13). Moreover, only 19% of the electrode sites contained both neurons and LFPs that encoded the same linguistic features (one-sided permutation test, P = 0.33, n = 1,000, Extended Data Fig. 8). Thus, while many of the recorded sites contained LFP signals that were selective, the linguistic features encoded were often distinct from those encoded by the individual neurons when examined at the μm level.

Compared with their local field activities, the neurons also displayed a significantly stronger modulation (defined as z-scored differences in activity; two-sided permutation test, P < 10−4) and decoding performances for many of the neurons were significantly higher despite their sparse spiking activities (one-sided permutation test, P < 10−4, n = 10,000 for all features except constituency depth and closing level; Extended Data Fig. 8). Notably, 8.5% (49 out of 579) of the neurons displayed an extreme degree of selectivity to specific linguistic features, as defined by a z score of 2.0 or more51, whereas almost none of the LFP sites displayed such selectivity (1.2%; χ2 contingency test, χ21 = 31.5, P = 2.0 × 10−8; Fig. 5). Thus, in contrast to the LFP signals, many of the neurons appeared to specialize in encoding specific linguistic features based on their tuning properties.

Discussion

Using wide-scale single-neuronal and local field potential recordings from across the human frontotemporal cortex in combination with natural language processing models, parsing and decoding techniques, we identify a finely detailed representation and partitioning of linguistic information during the production of natural speech. Many of the neurons changed their activities selectively based on the linguistic properties of the upcoming words even when they were closely correlated. They could also reliably predict their linguistic properties even when taken from sentences that differed in their context and topic—suggesting a process that could allow linguistic information not only to be encoded at a cellular level but to also be reliably read-out from the activity patterns of the neurons during natural speech.

To construct meaningful speech, linguistic analyses suggest that humans recursively assemble words into larger structures in a way that enables us to convey specific meanings4,5. Here we found cells that reflected not only basic lexical representations such as parts of speech, but that also tracked the higher-order syntactic structure of the sentences, from how the words grouped into phrases to their phrase mergers and order. These neurons responded selectively to these linguistic features while displaying little response to other features such as the sound-spectral properties of speech suggesting a process that could plausibly support the recursive (for example, hierarchical versus serial5,16,52,53 or modular54) nature of language. They also reveal some of the basic building blocks through which the syntactic structure of language can be robustly encoded and that could enable the generative construction of sentences during language production.

Collectively, we find that these cell populations provided a richly detailed picture of the words, phrases and sentences being produced. These cells uniquely described syntactic features such as the grammatical relationships between words and their parts of speech. They also distinguished the syntactic properties of the words from their semantic meanings, suggesting that these representations were dissociable at a cellular level. When modelled in aggregate, the activities of these populations could also reliably capture the specific combinations of features that described the words, even when decoded from distinct sentences. They could even capture contextual information about the specific sentences being construction, with contextual model predictivities increasing up to a second or more before utterance—revealing long contextual integration windows55 that are largely consistent with those predicted by prior behavioural56,57 and neurophysiological22,23 studies. Together, such a ‘bookkeeping’ process could therefore allow these cell ensembles not only to provide a combinatorial representation of the words being produced but to also enable their linguistic properties to be encoded at a highly granular level.

Another important finding from these recordings is that information encoded by these neurons was distributed broadly across the frontotemporal cortex. Previous imaging studies and more recent electrocorticographical recordings have demonstrated widespread engagement of areas that are more prominent on the left language-dominant hemisphere. These homotopic areas of activation span large parts of the lateral temporal and prefrontal cortices7,8,9,10,11 and have been shown to extend well beyond primary speech regions, such as the ventral premotor cortex41,58. They have also been found to reliably engage in language production and perception by validated language-localizer tasks, together revealing a broad amodal language-selective network of areas that are involved in human speech28,59. Here, consistent with these studies28,60, our observations provide tentative evidence for the distributed but left-lateralized and regionalized representation of linguistic information at the cellular scale. They are also consistent with recent animal studies that have suggested the redundant and distributed representation of sensorimotor information61, suggesting a process that could allow linguistic information to be both widely accessible and concentrated in specialized areas.

Finally, when examined locally, we find an interesting distinction between the encoding properties of individual neurons and their populations’ wider field potential patterns. Specifically, linguistic information represented by neurons was found to be largely distinct from that of their LFPs, even when recorded from the same focal cortical sites. Moreover, in contrast to their surrounding LFP patterns, the activities of many of these neurons appeared to be highly tuned to or specialized in responding to specific features, suggesting a hierarchy of information at a microscale level. By providing site-specific AP-to-LFP comparisons, these findings therefore highlight the heterogeneity of linguistic representations at a local population level and offer a prospective link by which to begin bridging our understanding of language processes across study modalities.

Together, these findings reveal some of the most basic neural elements that underlie human language and offer a prospective approach for studying its cortical organization at a combined micro, meso and macro scale. Given these observations, it is interesting to conjecture whether similar mechanisms or linguistic representations may be shared between speech production and perception, or whether neural activities may be optimally described by certain language processes5,16,52,53 and models54. Further investigations could also determine whether these findings generalize to other communicative modes, such as writing or scripted speech and further clarify how expressive elements like prosody and tone are represented within these contexts1,2,3,4,5,62,63. Additional studies will be needed to examine neuronal responses from specific eloquent areas, such as the left inferior frontal gyrus, and to record more widely across the brain in order to paint a more complete picture of language production. Here our findings provide an initial starting point from which to begin addressing these questions, and through which to study the basic cellular elements that underpin human language.

Methods

Single-neuronal recordings

Participants

All aspects of the study were approved and performed in strict accordance with guidelines set by the Massachusetts General Hospital Institutional Review Board (IRB) and Harvard Medical School for the ethical conduct of research. All of the participants included in the study were scheduled to undergo epilepsy monitoring30,33,34,70. The decision for surgery was made by a multidisciplinary team that commonly included neurologists, neurosurgeons, radiologists and neuropsychologists. Operative planning was made independently by the surgical team and without consideration of study participation. Once the decision to undergo surgery was made, the participant’s research candidacy was reviewed. The participants were enrolled only if the surgical plan was for cortical surface grid recordings for seizure monitoring. The patients were aged at least 18 years, and they had intact language function and were able to provide informed consent. A total of eight participants was included in the study (3 female and 5 male, mean age of 40 years; range, 27 to 52 years). All of the participants that were consented were free to withdraw from study participation at any time without impact on clinical care.

Microelectrode recordings

For each participant, multi-electrode microarrays (Utah arrays; Blackrock Microsystems) were implanted into an area planned for surgical resection. These micro-arrays (used for research) were placed within the cortex in tandem with the surface cortical grids (used for clinical monitoring) and were confirmed to be positioned in areas planned for resection based on intraoperative cortical mapping, as described previously30,71,72,73. The micro-arrays are configured in a 10 × 10 electrode channel mapping separated by 400 μm (96 recording electrodes covering an approximate area of 4 × 4 mm). For placement, the electrode arrays were implanted using a semi-automated impactor and were connected to a temporarily implanted female port. A total of eight arrays was implanted into eight participants. On the left hemisphere, two were in the anterior temporal cortex, one was in the posterior temporal cortex and one was in the prefrontal cortex. On the right hemisphere, one was in the anterior temporal cortex, two were in the posterior temporal cortex and one was in the prefrontal cortex. Together, these recordings lay in regions that have been shown to reliably engage in language production and sentence construction41,42,43,65,66 and to display robust language-selective responses41,42,44,65,74,75,76 and activation by validated language localizers on imaging studies28,41,45 (Fig. 1a). Each participant had a single array. The voltages were recorded and amplified via a 96-multichannel acquisition processor, and a band-pass filter was applied to the neuronal signals (150Hz to 8 kHz; 1-pole low-cut and 3-pole high-cut with 1,000× gain). The signals were then digitized at 30 kHz and stored offline. Audio recordings were tracked and aligned through a synchronization trigger that was integrated into both the neuronal and audio recording systems (Natus Medical). After clinical monitoring, all arrays were removed.

Single-unit isolation and LFP processing

To select and cluster APs of putative neurons, an automatic waveform detection was commonly performed using Kilosort 3.0 (https://github.com/MouseLand/Kilosort). The selected APs were then manually curated using Phy (https://phy.readthedocs.io/en/latest/, version 2.0b6). Candidate units were clustered based on their voltage waveform morphologies, spiking rates and their distribution profiles of inter-spike intervals, auto-correlation and cross-correlation profiles. We excluded units that did not demonstrate waveform stability over the course of recordings or that did not have inter-spike-interval profiles or waveform morphologies consistent with those of cortical neurons. Units from different sessions were recorded at separate time periods and across different linguistic contexts and were therefore sorted separately77,78,79. Finally, to confirm the quality of recordings, we used SpikeInterface to extract the measures of the single-unit sorting (v.0.102.3). Here the signal-to-noise ratio was defined as the ratio of the maximum amplitude of the mean spike waveform to the s.d. of the background noise on the same channel. Noise was computed through the median absolute deviation of the peak channel. Isolation distances were used to quantify the separation of putative units from their neighbours in principal component space. Here, after projecting spike waveforms into a lower-dimensional principal component space (across 5 principal components), the isolation distances were calculated using the Mahalanobis distance. Lastly, the inter-spike intervals for each unit were defined by the time intervals between consecutive spikes. Any putative units that had a high percentage of ISIs lower than 2 ms or firing rates lower than 0.1 Hz were excluded. Across all the single units, the mean signal-to-noise ratios were 2.5 ± 0.16 (mean ± s.d.), the mean isolation distances were 37.6 ± 2.7 and the mean peak inter-spike intervals (that is, the mean times that the inter-spike interval distributions peaked) were 8.7 ± 0.4 ms and were largely consistent with previous literature using analogous recording techniques32,80,81.

LFPs were obtained from the same electrode contacts as our single-neuronal recordings. Here, for each electrode of the Utah array (96 active channels in total per array), the raw voltages were processed through a series of notch filters to eliminate the AC powerline at 60 Hz and its higher-order harmonics. Then, a fourth-order Butterworth low-pass filter was applied with a cut-off frequency of 300 Hz to capture LFPs. Finally, the filtered signals were downsampled to 1 kHz for downstream analysis.

Language production and audio recordings

We tracked the activities of neurons as part of real-time conversations to study their responses during natural language production42,43,46,47,48,76,82. Here the participants responded freely to questions and prompts that were given ad hoc, therefore allowing them to produce phrases and sentences in a manner that reflected natural speech (Extended Data Table 1). The sentences were also constructed de novo (for example, rather than being simply read or repeated) and therefore enabled neuronal responses to be evaluated independently of explicit sensory cues. Thus, the participant may be prompted with a question such as “Do you want a phone right now?”, to which the participant may respond with the compound sentence “Yes, keep it there just in case, and then just bring that back over”. Therefore, in alignment with growing efforts in the field to study natural language21,42,43,46,47,48,76,82,83,84,85,86,87, the sentences produced by the participants varied naturalistically in structure and content, were constructed de novo and were representative of natural language production.

Together, the questions heard by the participants could be divided into 20 main topics and covered wide-ranging fields that included: dates and times, feelings and emotions, spatial locations and orientations, personal descriptions, comfort and reassurances, personal thoughts, conditions and situations, thoughts about specific events, general commentary, health, disagreement and personal conflicts, general activities, opinions, validation and agreement, uncertainty about events, passive questions, active questions, binary questions, visual attributes and social greetings. Quantifying the diversity of question topics across the participants, we also calculated the Shannon entropy of the topics for each participant as: $H(X)=-\sum _{x\in X}p(x)\log \,p(x)$, where x labels the topics (n = 20), and p(x) is the frequency of prompts belonging to each category. Given that each participant received an average of 14.1 ± 4.9 topics and with a theoretical maximum entropy of 3.0, using an even distribution across 20 topics, we find that the entropy ranged from 1.4 to 2.3, with an average of 2.0 ± 0.3 (mean ± s.d.), meaning that the topics were distributed broadly across the participants.

To align each word to the AP activities and LFPs, each word was transcribed and timestamped from the audio data in an automated manner using either Vosk API (16 kHz 16 bit mono; vosk-model-en-us-aspire-0.2) or WhisperX (https://github.com/m-bain/whisperX). After automatic transcription, each word and its timestamp were then manually validated and corrected by an independent scriber (Audacity; v.2.3). Words containing participant-identifying information were replaced by a set of common names that were independent of the participants.

Single-neuronal and local field responses to linguistic features

Sentence parsing and linguistic feature labelling

A sentence parsing approach was used to identify linguistic features on a word-by-word level as produced by the participants during natural speech. This approach, based on natural language processing models, allows for the basic linguistic properties of words (for example, their parts of speech) to be reliably labelled and tracked1,2,3,49,88. Here we used two complementary parsing techniques.

Constituency parsing for each sentence was obtained from the AllenNLP package49,88. Here the word’s part of speech (for example, adverb) was labelled following the Penn Tree Bank, and similar labels were grouped (Extended Data Table 1b; although note that different parsers have also been shown to display differences in labelling)89. The word’s constituent label was determined on the basis of which constituent the word belonged to, and was obtained by traversing the parse tree upward from the word to the next highest hierarchical level. The constituency depth of each word in the constituency tree was determined by the word’s hierarchical distance to the sentence root, and the word’s ordinal position was its order within the sentence. Finally, the closing level was determined by the number of finished phrases after the word was produced (Extended Data Fig. 1a). Words that were not located at the rightmost branch of the parse tree were assigned a closing level of 0, as they did not close any syntactic structures. Levels with a single child node were not counted as an extra level.

Dependency parsing for each sentence was similarly obtained from the AllenNLP package49,88 that followed the model of Deep Biaffine Attention for Neural Dependency Parsing90. Built on the verb of a sentence as the root, the dependency tree grows hierarchically with child nodes showing direct dependency on their parent node. Here the grammatical relationship between words (for example, object of a preposition) was determined based on their dependency. To allow comparison across features, the top 10 most frequent categories for each feature were selected for neuronal analysis. For depth, closing level and position, a maximum of 8 was capped on features with excessive layers/number of words with word position then shifted the range to 1–9 (Extended Data Table 1b).

Single-neuronal analysis

Selectivity of neuronal responses

To evaluate selectivity of the neurons to specific linguistic features (for example, parts of speech), we compared their firing rates for words of the target category (such as adverbs) against all words in other categories (for example, all non-adverbs; Extended Data Table 1b, d). This classification was performed for each linguistic category (such as nouns, verbs) and for each neuron. Statistical significance was determined using the one-sided Mann–Whitney U-test (α = 0.05), with P values adjusted using the FDR method for multiple comparisons across categories. To illustrate the specificity of neuronal responses within the population, we combined the selectivity of all neurons and performed a dimension reduction algorithm using UMAP transformation91. Here to characterize the tuning properties of the neurons during the canonical word planning period27, we used a window from −400 to −100 ms aligned to word onset. However, the precise window was found to have little effect on the proportions of selective neurons identified (−400 ms or −1,100 ms before word onset; χ2 test, P > 0.25 for all syntactic features).

Modulation of neuronal responses

While the selectivity of a neuron describes whether the neuron responds preferentially to a specific linguistic feature (for example, adverbs), its neuronal modulation quantifies the magnitude of this response. We calculated the baseline-normalized z scores for individual neuronal firing rates as:

$${Z}_{i}=\frac{{\mu }_{i}-{\mu }_{\mathrm{other}}}{{\sigma }_{\mathrm{other}}}$$

where the index i labels the firing rates for the linguistic feature to which the neuron is selective, ‘other’ labels the firing rates for all other features; μ labels the mean and σ labels the s.d. To assign a z score to each linguistic feature (for example, parts of speech), we identified the category within that feature that yielded the most significant response (the lowest P value) and selected its corresponding z score.

Controls for generalizability of neuronal responses

To ensure that the selectivity of the neurons was not biased by the particular sets of words, we performed an additional bootstrapping procedure in which we randomly subsampled the words until reaching 80%, 85%, 90% and 95% of the number of words in the original dataset. We then repeated this procedure 100 times to ensure that we broadly sampled across the available words (Extended Data Fig. 1c).

Controls for collinearity with multiple regression analysis

To confirm the selectivity of the neurons and to further confirm that their responses were not influenced by potential collinearities between the features, we performed a multiple regression analysis. Here we used Lasso regression (regularization α = 0.01) to predict neuronal firing rates from one-hot embeddings of linguistic features, thus accounting for the potential effects of multicollinearity between variables51,92. This yielded standardized coefficients (β) for each feature. Significance was then determined by randomly shuffling the neural firing rates 10,000 times to generate a null distribution of β values. Finally, P values were derived by comparing observed coefficients against this distribution, with a Bonferroni correction applied to account for multiple comparisons. Second, we evaluated for the potential effect of other related properties such as word frequency, length and surprisal (accounting for 7.9%, 3.6% and 11.9% of the neurons, respectively), but found little effect on the proportion of selective neurons (Extended Data Table 1c). Here, word-level surprisal was obtained from the Vicuna-7B model whereby input text for each sentence was tokenized, and token-level surprisal was computed as the negative log-probability given its preceding context (derived from the model’s output logits)93. Word-level surprisal was defined as the sum of surprisals across all subword tokens comprising that word given their sentence context, whereas word frequency was calculated more simply based on the occurrence of each unique word across all the recorded sentences and sessions. Word length was calculated as the number of letters in each word and was top coded at eight characters to reduce the influence of outliers94. Word surprisal, frequency and length were all discretized to allow for comparison with the other linguistic features. Finally, we separately identified neurons that maintained consistent selectivity using a sequential feature selection approach that removes correlated features or features that are uninformative51,95 and again found little effect on the proportion of selective neurons (Extended Data Table 1c reports on the percentages of neurons that were consistently selected in both approaches).

Controls for specificity and overlap between neurons

To further evaluate whether and to what degree the tuning properties of the neurons overlapped, we calculated the percentage of cells that demonstrated significant selectivity for two or more distinct features (for example, parts of speech and dependencies). Here, for every feature pair, we first constructed a neuron × feature binary matrix indicating selectivity for each linguistic feature (Fig. 2a,c). Meanwhile, we independently shuffled the selectivity assignments for each feature across neurons. This preserved the marginal selectivity rates of neurons for each individual feature, while breaking only the joint structure—the association between a neuron being selective for one feature and also being selective for another feature. This permuted data, in turn, is the null distribution needed to test whether co-selectivity is different from expected under independence. Next, we calculated the co-selectivity for each pair of features using the shuffled data. These steps were repeated for 105 iterations to generate a null distribution of co-selectivity values for each feature pair51,96. The observed overlap for each feature pair was then compared against this null distribution to determine whether it differed significantly from chance-level independence (P < 0.05, two-sided permutation test, Bonferroni corrected for the number of pairs tested). Additional confirmatory analysis of independence included a χ2 test and Fisher’s exact test that compare the observed to expected overlapping frequencies of neurons for all possible pairs (P < 0.05, Bonferroni corrected for the number of pairs tested). A χ2 test and a Fisher’s exact test were also performed to confirm the robustness of the results. To further examine the relationship between neural activity (the neurons’ firing rates) and feature pairs (parts of speech and dependencies), we first identified the subset of neurons that exhibited significant selectivity to both parts of speech and dependencies. For these neurons, we then calculated their z-scored firing rates by comparing the features to which they were selective versus other categories. Finally, we performed a linear regression analysis in which we predicted the z scores for one feature (dependencies) using the z scores of the other (parts of speech), using the neurons as observations. Here, the P value reflected whether the slope of the linear regression was significantly different from zero.

Controls for lower-order word properties

To evaluate neuronal responses in relation to low-order word properties, we quantified the frequency (pitch) of each uttered word (Parselmouth interface for Praat, v.0.4.3)97. The mean pitch was extracted across the duration of each word and aligned to the word production planning period.

Comparing neuronal activities across hemispheres and cortical areas

To allow for comparison across cortical areas (anterior temporal cortices, posterior temporal cortices and posterior frontal cortices) and hemispheres (left and right), we evaluated differences in the proportions of neurons that displayed selectivity using a χ2 test. We also evaluate the degree of modulation using Wilcoxon rank-sum test for hemispheres and a Kruskal–Wallis H-test for cortical areas. We next used a one-sided permutation test to further confirm differences in modulation across areas and hemispheres. Finally, two-way ANOVA was used to confirm these findings and to further test potential interaction effects.

Comparing single-neuronal to local field activities

In addition to the APs, we evaluated the LFP activities recorded from each site50,98,99,100. We calculated the selectivity and modulation of the LFPs that were recorded across all channels. To further allow for a comparison between single neuronal activities and LFPs50,101, we spatially aligned the activity of each neuron with the specific LFP channels from which they were recorded. We then sorted and visualized the single-neuronal z scores and applied this same sorting order to the LFP z scores (Fig. 5). A χ2 contingency test was used to evaluate the proportions of neurons and LFP sites that displayed selectivity and a one-sided permutation test was used to evaluate the overlap in linguistic selectivities (n = 1,000). A two-sided permutation test was used to compare their degrees of modulation (n = 10,000).

Neuronal decoding analysis

Decoding linguistic features from neural activity

Decoding model

A multinomial logistic regression classifier (Scikit-learn package v.0.21.3) was used to decode the linguistic features from neuronal activity with L2 regularization to minimize the loss function as

$$\mathop{\text{min}}\limits_{{\bf{w}},c}\frac{1}{2}{{\bf{w}}}^{T}{\bf{w}}+C\mathop{\sum }\limits_{i=1}^{n}\text{log}(\text{exp}(-{y}_{i}({{\bf{x}}}_{i}^{T}{\bf{w}}+c))+1)$$

where C is the inverse of the regularization strength; vector w are coefficients and c is the bias; xi represents the feature vector of neuronal firing rates, ${y}_{i}\in [-\mathrm{1,1}]$ labels the actual category for word index $i\in [1..n]$, which labels the number of observations (words). We used balanced class weights, with a regularization strength of 0.0001. The mean decoding accuracies were obtained from the independent observations on the test dataset.

Neural population decoding

As described previously27,102,103,104,105, to perform the population-level decoding analysis, we aggregated neuronal activity across all participants and sessions. Here we performed a stratified sampling to split the training and testing datasets51,106. Specifically, for a given feature (for example, parts of speech), we randomly selected 70% of trials belonging to a specific category (such as nouns) for the training set and an independent 30% of trials for the testing set. These were repeated 1,000 times with independently drawn random partitions on each iteration for the non-shuffled data. The neural-feature matrix ${X}_{\mathrm{combined}}=[{X}^{(1)}|{X}^{(2)}|\cdots {X}^{(S)}]{\epsilon }{{\rm{{\mathbb{R}}}}}^{{W\times N}_{\mathrm{total}}}$, where ${N}_{\mathrm{total}}={\sum }_{s=1}^{S}{N}_{s}$ is the total number of significant neurons pooled across S sessions, and W labels the number of words. ${X}^{(s)}$ is composed by ${x}_{w,n}^{(s)}$, which is the firing rate of neuron n in session s for word w. The mean decoding accuracy was obtained as the average decoding performance across the sampling iterations.

Thus, whereas the individual words may differ across sessions, their categories (for example, words that are nouns) could be directly decoded and compared. The neural activity vectors for these subsampled words were then horizontally concatenated across all selective neurons from all participants and sessions to create a unified training and testing matrix, in which the rows represent the aggregated feature observations and the columns represent the combined pool of relevant neurons from all sessions. This process was then repeated for every category (such as nouns) per linguistic feature (parts of speech), with the resulting matrices vertically concatenated to form the final datasets. Finally, to assess statistical significance and estimate chance-level performance, we conducted a permutation test by randomly shuffling the feature labels and repeating the entire modelling procedure to generate a null distribution (n = 1,000). The P value for the permutation test was then derived by calculating the frequency with which the shuffled accuracies exceeded the actual decoding accuracy (P < 0.05; Fig. 2d).

Single and bootstrapped observations

All primary decoding results in the main text, including those shown in Fig. 2, are provided for individual words. However, to further account for the sparse spiking of neurons per word and to ensure high fidelity of decoding27,107,108,109,110, we also generated new observations within the same category by randomly selecting and averaging the neural firing rates across ten observations, and then averaged the firing rates to create a new observation. This procedure therefore further accounts for the neurons’ sparse spiking per word while importantly preserving the linguistic features being decoded. To account for the difference of word numbers in each section, we used this bootstrapping with replacement to standardize the number of observations per category across session. Here the above process was repeated until we created the same number of observations (500 new observations per unique category for the training dataset and 200 for the testing set). This procedure, including the randomized data partitioning and bootstrapping, was repeated for 1,000 iterations to ensure robust estimation and to account for variability in the sampling process. Finally, we confirmed the consistency of observation-averaged decoding by repeating the bootstrapping procedure but varying the number of bootstrapped observations between 10, 20 and 30 (Extended Data Fig. 2a) as well as by selecting neurons that specifically displayed selectivity on model training (Extended Data Fig. 2b) and by selecting only neurons that displayed selectivity for individual participants (Extended Data Fig. 2e). Unless specified, all decoding analyses were performed using observation-averaged activity.

Generalizability of linguistic representations

To examine the generalizability of the linguistic representations across contexts, we used Vicuna model embeddings to cluster the sentences into topics based on their model embeddings, as described previously111,112. This step clusters the sentences into distinct topics based not only on the lexical meanings of the words but also on their sentence contents and contexts. We used the top-most layer of the model’s hidden embeddings for all words per sentence, and categorize into three or four equal clusters (k-means; Extended Data Fig. 2f).

To further confirm the generalizability of linguistic representations across the participants, we performed a bootstrapping procedure with 5 observations, then trained a classifier on a random selection of 50 observations for each category, and tested on randomly selected 20 for each category. This process was then repeated with 50 repetitions for each session. We then applied a Friedman test to examine whether dropping one participant significantly impacts the observation-averaged decoding accuracy. We also confirmed the relationship between the decoding performance and the number of neurons recorded27,113 by randomly sampling cells from across the participants and showing that the number of cells correlated with the decoding performance (Extended Data Fig. 2c). Finally, for further comparison, we repeated these analyses but now evaluated decoding performances across individual cortical areas and hemispheres.

Language models

Syntactic and semantic embedding models

Embedding models were used to capture the combination of linguistic features that described the specific words being produced. For example, even though the word ‘dog’ in the sentence ‘The dog ate the bone’ and ‘The man walked the dog’ is the same, the models distinguish its varying syntactic roles—such as nominal subject versus object—as well as the distinct semantic representations between different nouns like ‘dog’ and ‘bone’1,2,3,49,88. To first examine the combination of syntactic features that defined each word, we constructed embedding models that tracked all features obtained from the constituency and dependency parsers by representing each category (for example, nouns within parts of speech) with a one-hot vector of zeros and ones, and concatenating them for all syntactic features. This combination of features therefore provided a comprehensive representation of all tested features per word (Fig. 3a). Here, the embedding vector ${\bf{E}}(w)$ was written as

$${{\bf{E}}}_{\mathrm{syn}}(w)={{\bf{v}}}_{1}(w)\oplus {{\bf{v}}}_{2}(w)\oplus \cdots \oplus {{\bf{v}}}_{N}(w)$$

where w is the word for which the embedding is being constructed and vi(w) is one-hot vector representation of the specific value of the ith feature category for word w.

Second, we examined the lexical semantic properties of each word27 using a pretrained semantic embedding model (word2vec)114 to obtain a vectoral representation for each word. This model maps the semantic meaning of words onto an embedding space in which words that have different meanings also have different vectors. Embeddings were obtained from the word2vec model trained on the Google dataset (about 100 billion words, gensim, v.4.3.2). Out-of-vocabulary words (approximately 7% of all words) were excluded. We also verified that zero-vector padding of these out-of-vocabulary words had little effect on predictivity results (two-sided permutation test, n = 105; P = 0.37 and P = 0.60 for semantic only and combined embeddings; see Language model predictivity).

Finally, to examine the words’ semantic and syntactic properties, we combined the syntactic and semantic vectors to form a shared embedding model which was defined here as

$${{\bf{E}}}_{\mathrm{comb}}(w)={{\bf{E}}}_{\mathrm{syn}}(w)\oplus {{\bf{E}}}_{\mathrm{sem}}(w)$$

where Esem is the semantic embedding vector for word w.

Higher-order contextual embedding models

While the above embedding models enabled us to examine how combinations of features are encoded by neurons on a word-by-word level, they did not determine whether the neurons also reflected their specific contexts (for example, the word ‘bark’ within ‘The dog has a loud bark’ versus ‘The tree has thick bark’)6,115,116,117,118. Therefore, to further address this question, we also used contextual models that were capable of further capturing these contextual variations22,23. Here we used Vicuna-7B (v.1.1), a large language model that was fine-tuned from the LLaMA architecture to serve as a contextual framework for generating high-dimensional vector embeddings. Vicuna-7B is an open-source autoregressive language model based on a transformer architecture and consists of 32 transformer layers and 4,096 embedding dimensions and was further tuned on large-scale, user-shared conversations.

Here the contextual model embeddings were obtained by inputting words from each sentence produced by the participants, such that each sentence was associated with corresponding hidden embeddings for each layer23,42. Next, to extract word embeddings across varying contextual lengths for each sentence (Fig. 3c), the word strings were generated by sliding a window of size N across the preceding word sequence. These word strings were then processed through the Vicuna-7B model to extract hidden state representations for all layers:

$${{\bf{E}}}_{\mathrm{contextual}}({w}_{i},L)={{\bf{H}}}_{L}(w)$$

where HL(w) is the hidden state vector extracted from layer L of the model for word sequence $\{{w}_{1},{w}_{2},\cdots ,{w}_{N}\}$. In combination, this approach therefore allowed us to systematically manipulate the amount of preceding context provided to the model, ranging from two words (that is, N = 2) to extended ten-word sequences. The hidden state activations across all layers were then extracted at the last token and concatenated, thus providing contextual information for each word in relation to its specific sentence context and its varying context lengths. Further validation experiments include using a diverse suite of large language models (Extended Data Fig. 3 and Extended Data Table 2). All models and tokenizers were loaded through Huggingface in Python119 and run with a NVIDIA GeForce RTX 4090 GPU.

Predictivities of neurons across language models

Language model predictivity

We examined the degree to which embeddings from each of the language models (syntactic, semantic and higher-order contextual models) were predictive of the neurons’ activity patterns. First, to allow for comparison across models and to limit overfitting, we first performed principal component analysis (optimal dimensionality was determined as 30 principal components; Extended Data Fig. 4a) from each model layer (constrained within a range of [−10,000, 10,000] to mitigate the influence of outliers). These matrices were then temporally assigned using non-overlapping 100 ms bins using embeddings for the bins that occupied each word. Similarly, we calculated the neural firing rates for individual neurons by binning their APs into 100 ms intervals. The overall population neural activity was calculated from the combined firing rate patterns of all neurons and were smoothed by a three-bin moving average.

Next, we calculated predictivity120, which is the Pearson correlation coefficient between the language model’s predicted neuronal activities and the observed neuronal activity:

$$\text{Predictivity}\,=\,\frac{\mathrm{cov}(y,\hat{y})}{\sqrt{\mathrm{var}(y)\times \mathrm{var}(\hat{{y}})}}$$

where y labels the actual neuronal firing rates, and $\hat{y}$ labels the predicted neuronal firing rates based on the model’s embeddings. cov($\cdot $) labels the covariance matrix, and $\mathrm{var}(\cdot )$ labels the variance.

Finally, to predict neuronal activity from the models’ embeddings, we performed a Ridge regression (Python scikit-learn package; v.0.21.3) using the dimensionally reduced embeddings from each contextual model and layer to minimize the cost function:

$$\mathop{\text{min}}\limits_{{\bf{w}},c}\alpha \,{{\bf{w}}}^{T}{\bf{w}}+\mathop{\sum }\limits_{i=1}^{n}({y}_{i}-({{\boldsymbol{x}}}_{i}^{T}{\bf{w}}+c){)}^{2}$$

where yi is the neuronal activity for observation $i\in [1..n]$; vector w are coefficients and c is the bias; xi represents the feature vector of language model embeddings; α is the regularization strength that was set as 0.01. The predictivities were then calculated separately for each session. This approach therefore enabled us not only to obtain an aggregate account of the predictivities of all recorded neurons but also to evaluate their consistency across participants and sessions. The regression weights were trained using data from all but one sentence within each session, and were then used to predict neuronal responses for the withheld sentence. By repeating this process across all sentences, we obtain a full set of predicted activities for each layer of the model.

For single-neuronal predictivities, we calculated the predictivity from each individual neuron’s firing rates and then averaged their predictivities (Fig. 3e (right) and 3f), while the mean population predictivities were obtained by calculating the predictivity from the mean firing rates from all neuronal firing rates in a session and then averaging across sessions (weighted by the number of neurons; Fig. 3e (left)). Here predictivities in Fig. 3b were calculated from the averages during the pre-utterance period and were evaluated with fixed temporal offsets that aligned neuronal activity (obtained before utterance).

To establish a baseline chance performance for predictivity, we generated a null distribution by randomly permuting the average neuronal activities and recalculating the predictivity on this shuffled dataset. This randomization procedure was conducted for the same number of iterations as the primary modelling analysis. We then performed a one-sided permutation test to determine whether the predictivity of the actual data was significantly higher than that of the shuffled control. For comparison of predictivities across hemispheres and brain areas, we used a rank-sum test and a Kruskal–Wallis test, respectively.

Neural predictivity controls

To confirm that the predictivities reflected information specifically captured by the language models, we performed an additional control analysis using randomly generated non-meaningful stimuli as inputs to the contextual model. Here the tokenizer was replaced with a random drawing of a number (index of a token) to ensure that no linguistically meaningful information was input into the model. Furthermore, to confirm that the predictivity reflects the contextual information specific to each sentence and cannot be simply explained by the order of words, we also performed the swap sentence control. We first grouped the sentences produced by the participants based on their word length to ensure matched comparisons. Within these word-length-matched groups, we then randomly shuffled the sentences. These swapped-sentence embeddings were then used to calculate the predictivity of neuronal population activity. To allow for consistency with our contextual predictivity analyses above, both of these controls were performed at the population level, and both results were reported at −1,000 ms before utterance.

Neural predictivity dynamics and mapping

To characterize the temporal evolution of neural predictivity, we systematically shifted the temporal alignment between neuronal firing rates and contextual model embeddings by introducing a series of lead and lag offsets in 100 ms increments. This approach enabled us to identify the specific latency at which neural activity most strongly reflects the linguistic information encoded in the contextual model embeddings across a wide temporal window, spanning from 2,000 ms before word onset (pre-articulatory planning) to 2,000 ms after onset (post-articulatory processing). This was applied to both the population’s firing rates (Fig. 3e (left)) and the individual neuronal firing rates (Fig. 3e (right)).

Finally, to illustrate the relationship between the neurons’ peak predictivity and their linguistic feature selectivity, we calculated the neuron’s average predictivity for each model layer and time point. This approach therefore enabled us to map the layer and temporal lag at which neuronal predictivity peaked for each neuron based on its encoding properties (Fig. 3f and Extended Data Fig. 4). For all contextual predictivity results in Fig. 3c,d and Extended Data Figs. 3 and 4, we used the −1,000 ms timepoint relative to the embeddings, which was determined as the approximate predictivity peak observed in our time-resolved population analysis. However, the exact timing used for analyses was not critical as the contextual model significantly outperformed the other syntactic, semantic and combined models (one-sided permutation test, P < 0.004 at −1,000 ms across all comparisons; P < 0.042 at −400 ms to −100 ms interval across all comparisons; n = 105) and exceeded chance for both tested intervals (one-sided permutation test, P < 10−5, n = 105).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

All primary data supporting the main findings of this study are available at Zenodo121 (https://doi.org/10.5281/zenodo.20128462).

Code availability

All code necessary for reproducing the main findings of this study are available at Zenodo121 (https://doi.org/10.5281/zenodo.20128462). All software used in this study are listed in the Reporting Summary.

References

Brennan, J. R. Language and the Brain: A Slim Guide to Neurolinguistics (Oxford Univ. Press, 2022).

Cooper, W. & Paccia-Cooper, J. Syntax and Speech (Harvard Univ. Press, 1980).

Horrocks, G. Generative Grammar (Longman Linguistics Library, 1987).

Coopmans, C. W., Kaushik, K. & Martin, A. E. Hierarchical structure in language and action: a formal comparison. Psychol. Rev. 130, 935–952 (2023).

Article

PubMed

Google Scholar

Nelson, M. J. et al. Neurophysiological dynamics of phrase-structure building during sentence processing. Proc. Natl Acad. Sci. USA 114, E3669–E3678 (2017).

Article

CAS

PubMed

PubMed Central

Google Scholar

Chomsky, N. Syntactic Structures (Walter de Gruyter, 2002).

Wacongne, C. et al. Evidence for a hierarchy of predictions and prediction errors in human cortex. Proc. Natl Acad. Sci. USA 108, 20754–20759 (2011).

Article

ADS

CAS

PubMed

PubMed Central

Google Scholar

Fedorenko, E., Duncan, J. & Kanwisher, N. Broad domain generality in focal regions of frontal and parietal cortex. Proc. Natl Acad. Sci. USA 110, 16616–16621 (2013).

Article

ADS

CAS

PubMed

PubMed Central

Google Scholar

Huth, A. G., de Heer, W. A., Griffiths, T. L., Theunissen, F. E. & Gallant, J. L. Natural speech reveals the semantic maps that tile human cerebral cortex. Nature 532, 453–458 (2016).

Article

ADS

PubMed

PubMed Central

Google Scholar

Pereira, F. et al. Toward a universal decoder of linguistic meaning from brain activation. Nat. Commun. 9, 963 (2018).

Article

ADS

PubMed

PubMed Central

Google Scholar

Tang, J., LeBel, A., Jain, S. & Huth, A. G. Semantic reconstruction of continuous language from non-invasive brain recordings. Nat. Neurosci. 26, 858–866 (2023).

Article

CAS

PubMed

PubMed Central

Google Scholar

Bouchard, K. E., Mesgarani, N., Johnson, K. & Chang, E. F. Functional organization of human sensorimotor cortex for speech articulation. Nature 495, 327–332 (2013).

Article

ADS

CAS

PubMed

PubMed Central

Google Scholar

Mesgarani, N., Cheung, C., Johnson, K. & Chang, E. F. Phonetic feature encoding in human superior temporal gyrus. Science 343, 1006–1010 (2014).

Article

ADS

CAS

PubMed

PubMed Central

Google Scholar

Hamilton, L. S., Oganian, Y., Hall, J. & Chang, E. F. Parallel and distributed encoding of speech across human auditory cortex. Cell 184, 4626–4639 (2021).

Article

CAS

PubMed

PubMed Central

Google Scholar

Goldstein, A. et al. Shared computational principles for language processing in humans and deep language models. Nat. Neurosci. 25, 369–380 (2022).

Article

CAS

PubMed

PubMed Central

Google Scholar

Pallier, C., Devauchelle, A. D. & Dehaene, S. Cortical representation of the constituent structure of sentences. Proc. Natl Acad. Sci. USA 108, 2522–2527 (2011).

Article

ADS

CAS

PubMed

PubMed Central

Google Scholar

Fedorenko, E. et al. Neural correlate of the construction of sentence meaning. Proc. Natl Acad. Sci. USA 113, E6256–E6262 (2016).

Article

CAS

PubMed

PubMed Central

Google Scholar

Glasser, M. F. et al. A multi-modal parcellation of human cerebral cortex. Nature 536, 171–178 (2016).

Article

ADS

CAS

PubMed

PubMed Central

Google Scholar

Molinaro, N., Paz-Alonso, P. M., Dunabeitia, J. A. & Carreiras, M. Combinatorial semantics strengthens angular-anterior temporal coupling. Cortex 65, 113–127 (2015).

Article

PubMed

Google Scholar

Pylkkänen, L. The neural basis of combinatory syntax and semantics. Science 366, 62–66 (2019).

Article

ADS

PubMed

Google Scholar

Giglio, L., Ostarek, M., Sharoh, D. & Hagoort, P. Diverging neural dynamics for syntactic structure building in naturalistic speaking and listening. Proc. Natl Acad. Sci. USA 121, e2310766121 (2024).

Article

CAS

PubMed

PubMed Central

Google Scholar

Goldstein, A. et al. Alignment of brain embeddings and artificial contextual embeddings in natural language points to common geometric patterns. Nat. Commun. 15, 2768 (2024).

Article

ADS

CAS

PubMed

PubMed Central

Google Scholar

Schrimpf, M. et al. The neural architecture of language: integrative modeling converges on predictive processing. Proc. Natl Acad. Sci. USA 118, e2105646118 (2021).

Article

CAS

PubMed

PubMed Central

Google Scholar

Leonard, M. K. et al. Large-scale single-neuron speech sound encoding across the depth of human cortex. Nature 626, 593–602 (2024).

Article

ADS

CAS

PubMed

Google Scholar

Khanna, A. R. et al. Single-neuronal elements of speech production in humans. Nature 626, 603–610 (2024).

Article

ADS

CAS

PubMed

PubMed Central

Google Scholar

Hagoort, P. On Broca, brain, and binding: a new framework. Trends Cogn. Sci. 9, 416–423 (2005).

Article

PubMed

Google Scholar

Jamali, M. et al. Semantic encoding during language comprehension at single-cell resolution. Nature 631, 610–616 (2024).

Article

ADS

CAS

PubMed

PubMed Central

Google Scholar

Fedorenko, E., Ivanova, A. A. & Regev, T. I. The language network as a natural kind within the broader landscape of the human brain. Nat. Rev. Neurosci. 25, 289–312 (2024).

Article

CAS

PubMed

PubMed Central

Google Scholar

Huth, A. G., Nishimoto, S., Vu, A. T. & Gallant, J. L. A continuous semantic space describes the representation of thousands of object and action categories across the human brain. Neuron 76, 1210–1224 (2012).

Article

CAS

PubMed

PubMed Central

Google Scholar

Chan, A. M. et al. Speech-specific tuning of neurons in human superior temporal gyrus. Cereb. Cortex 24, 2679–2693 (2014).

Article

PubMed

Google Scholar

Rubin, D. B. et al. Interim safety profile from the feasibility study of the braingate neural interface system. Neurology 100, e1177–e1192 (2023).

Article

PubMed

PubMed Central

Google Scholar

Nicolelis, M. A. L. Methods For Neural Ensemble Recordings (CRC Press, 2008).

Chapeton, J. I., Wittig, J. H. Jr., Inati, S. K. & Zaghloul, K. A. Micro-scale functional modules in the human temporal lobe. Nat. Commun. 13, 6263 (2022).

Article

ADS

CAS

PubMed

PubMed Central

Google Scholar

Truccolo, W. et al. Single-neuron dynamics in human focal epilepsy. Nat. Neurosci. 14, 635–641 (2011).

Article

CAS

PubMed

PubMed Central

Google Scholar

Indefrey, P., Hagoort, P., Herzog, H., Seitz, R. J. & Brown, C. M. Syntactic processing in left prefrontal cortex is independent of lexical meaning. Neuroimage 14, 546–555 (2001).

Article

CAS

PubMed

Google Scholar

Klaus, J. & Schutter, D. J. The role of left dorsolateral prefrontal cortex in language processing. Neuroscience 377, 197–205 (2018).

Article

CAS

PubMed

Google Scholar

Holland, R. & Lambon Ralph, M. A. The anterior temporal lobe semantic hub is a part of the language neural network: selective disruption of irregular past tense verbs by rTMS. Cereb. Cortex 20, 2771–2775 (2010).

Article

PubMed

Google Scholar

Matchin, W. & Wood, E. Syntax-sensitive regions of the posterior inferior frontal gyrus and the posterior temporal lobe are differentially recruited by production and perception. Cereb. Cortex Commun. 1, tgaa029 (2020).

Article

PubMed

PubMed Central

Google Scholar

Murphy, E., Rollo, P. S., Segaert, K., Hagoort, P. & Tandon, N. Multiple dimensions of syntactic structure are resolved earliest in posterior temporal cortex. Prog. Neurobiol. 241, 102669 (2024).

Article

PubMed

Google Scholar

Buchsbaum, B. R., Hickok, G. & Humphries, C. Role of left posterior superior temporal gyrus in phonological processing for speech perception and production. Cogn. Sci. 25, 663–678 (2001).

Article

Google Scholar

Wolna, A., Wright, A., Casto, C., Lipkin, B. & Fedorenko, E. The extended language network: language selective brain areas whose contributions to language remain to be discovered. Preprint at bioRxiv https://doi.org/10.1101/2025.04.02.646835 (2025).

Cai, J. et al. Natural language processing models reveal neural dynamics of human conversation. Nat. Commun. 16, 3376 (2025).

Article

ADS

CAS

PubMed

PubMed Central

Google Scholar

Castellucci, G. A., Kovach, C. K., Howard, M. A. III, Greenlee, J. D. & Long, M. A. A speech planning network for interactive language use. Nature 602, 117–122 (2022).

Article

ADS

CAS

PubMed

PubMed Central

Google Scholar

Pylkkänen, L., Bemis, D. K. & Elorrieta, E. B. Building phrases in language production: an MEG study of simple composition. Cognition 133, 371–384 (2014).

Article

PubMed

Google Scholar

Lipkin, B. et al. Probabilistic atlas for the language network based on precision fMRI data from >800 individuals. Sci. Data 9, 529 (2022).

Article

PubMed

PubMed Central

Google Scholar

Zada, Z. et al. A shared model-based linguistic space for transmitting our thoughts from brain to brain in natural conversations. Neuron 112, 3211–3222

View original source — Nature ↗

ShareShare on X Share on Facebook

How the brain builds sentences, neuron by neuron

Nature

TechnologyJun 17, 2026 · 1 min

Mapping the neuronal building blocks of human language with language models

Related stories

How the brain builds sentences, neuron by neuron

Daily briefing: The brain builds a sentence neuron by neuron

A mosaic of whole-body representations on the human precentral gyrus

Inside MisoTTS, an 8B Speech Model Built for Voice Continuation