The Cyborg Foundation Model

Oliver Shetler
May 29, 2024
9 min read

Upgrading Neuralink with our pretrained multimodal ensemble models

The landscape of Motor BCI technology is currently marked by a significant degree of fragmentation, with various models and systems developed in isolation. This disjointed approach often results in solutions that, while innovative, fail to address the comprehensive needs of users, particularly those with disabilities. The lack of integration among these disparate systems hampers the user experience and limits the potential for widespread adoption and practical utility of Motor BCIs in enhancing daily life.

A critical issue at the heart of this fragmentation is the entanglement of neural signals, often referred to as the “compositional code.” This phenomenon, where the same neural pathways are implicated in multiple and diverse motor tasks, complicates the decoding process, making it challenging to isolate specific intended movements from the cacophony of neural activity. As discussed in Integrated Decoding, while this entanglement renders simple predictive models ineffectual, there are two major causes for optimism.

First, dimensionality reduction techniques work well to differentiate body-part activations from movement trajectory codes. Second, increasing the diversity of body parts and decoding tasks paradoxically improves integrated decoding performance by enriching learned neural manifolds.

Even so, decoding diverse tasks across contexts poses a major challenge to the Motor BCI community. The high performance of existing specialized models depends upon leveraging assumptions regarding the type of task being executed in an experimental context. For decoders “in the wild,” the most plausible way forward is to identify task contexts in the real world and leverage this knowledge akin to the lab. We have identified several promising ways to identify task context, which can potentially bring BCIs out of the lab and into the hands of those who need them.

Our computational ambition is to harness the potential of deep learning models, especially those within the Generative AI domain, for developing context-aware multimodal ensemble decoders that can seamlessly integrate with VR, offering highly valuable User outcomes and opening new avenues for Motor BCI applications.

Context Identification

As seen in Disentangling Contextual Neural Dynamics, it is often possible to recover task context directly from neural data. However, there are limitations to this approach. The human motor cortex is just one part of a large system that orchestrates movements across the human body. Many aspects of this system involve control signals derived from the experienced context. This subsection discusses several theoretical possibilities that could substantially bridge the gap between lab Motor BCIs and those restoring User’s lifestyle.

First and foremost, any clinically viable Motor BCI should leverage all contextual information that it can feasibly collect. Across settings, it is possible to collect behavioral data that could provide context to an integrated Motor BCI. For example, eye-tracking technologies in VR environments could be leveraged. By determining the focal point of a User’s gaze, Motor BCIs can infer the User’s current task or intention. In situations where eye tracking is infeasible, recognizing objects within the User’s immediate environment through physical or virtual cameras can provide crucial contextual clues. This object recognition can help delineate the setting in which the User operates, thus offering a more nuanced understanding of their intended actions. Task-specific Motor BCI decoders can then be deployed.

Despite machine learning approaches offering broad applicability, Rules-Based Context Recognition within UI design presents a predictable and structured approach to estimating User needs, particularly in VR. Video games and interactive applications often employ contextual rules (such as geometric proximity) to intuitively switch control mechanisms, enhancing user interaction without the need for explicit commands. Well-designed contextual rules within UIs can enable users to seamlessly transition between tasks, leveraging the intuitive layout and design cues to guide their actions. In the OpenXR terminology, these may be the ActionSets VR application developers create to match specific available actions with particular application modes (i.e., what action the User can or cannot take). Specialist BCI decoders can be deployed at the behest of a User. For example, the User might approach a desk and the UI presents an option to sit. At this point, the User switches from a full-body locomotion decoder to a dexterous hand and arm decoder while in the explicitly defined context of sitting at the desk (Photo 4).

Photo 4; Kiwihug, 2024: Recognizing objects uncovers potential movement intentions.

In VR, any 3rd party applications the User wishes to use must expose this contextual information for this decoding strategy to succeed. To our delight, virtual other companies increasingly desire this scene or context understanding data, motivating application developers to expose this information standardly. For physical space understanding, Augmented or Mixed Reality hardware vendors such as Microsoft (XR_MSFT_scene_understanding) and Meta (XR_FB_scene) have extended OpenXR Spaces specification, which tracks objects in the physical world around the user, with more object recognition capabilities that VR applications can leverage at runtime. Should 3rd party VR applications not expose virtual contexts, one can always record the rendered frames and utilize a custom computer vision model to derive context. Custom (1st party) VR applications such as the uCat Client App (v0) may easily expose such virtual scene metadata.

So far: UI rules are fast and reliable but inaccessible while behavioral cues are fast and accessible but less reliable.

As a fallback, continuous attempted speech decoding represents a direct channel for deciphering User intent, serving as a flexible and reliable, but slow, control when behavioral cues are ambiguous, and UI rules are too inaccessible. The ability to detect and interpret commands articulated internally by the User can provide a clear indication of their desired action, bypassing the complexities of physical movement, gaze direction, and GUI interpretation mechanisms. We implemented this mechanic in a proof-of-concept VR application introduced in the following section, Integrating VR with Motor BCI.

Lastly, Sequence Learning can help integrate insights gained from any or all of the above methods to anticipate User actions. By understanding the typical sequence of tasks the User engages in, a Motor BCI could predict subsequent actions with greater accuracy. This anticipatory capability, grounded in the User’s habitual patterns, can significantly enhance the predictive power of Motor BCIs, making interactions smoother and more intuitive.

Together, these approaches offer a clear strategy to overcome the challenges of signal entanglement in Motor BCIs, promising a future where User’s intentions are understood with greater clarity and responsiveness, ultimately leading to more adaptive and user-centric Motor BCI systems.

Inter Subject Generalization

Another major barrier to widespread invasive BCI adoption among qualified users is the expense of training and updating models for each user. A significant hurdle in the broader adoption of Motor BCIs is the high cost associated with training and updating models for individual users. This customization is essential due to each User’s unique neural patterns and responses, but it demands substantial time, data, and computational resources, making the technology less accessible.

Transfer Learning, a technique widely used in various industries such as natural language processing and computer vision, offers a promising solution to this challenge. Transfer Learning applies knowledge gained from one task to solve related problems by using the lower layers of a pre-trained model to initialize a new model for a new task. This technique can drastically reduce the need for extensive data collection and model training from scratch, although it still requires significant training time (i.e., a day or two of data). This approach is particularly beneficial for Motor BCIs, where the underlying neural mechanisms of movement and intention share commonalities across individuals.

In several domains, models are now pre-trained on vast datasets, allowing for rapid fine-tuning with relatively few new data examples. Unlike Transfer Learning, fine-tuning trains an unmodified, fully pre-trained model on a small number of examples, sometimes as few as two or three samples (i.e., a couple of calibration exercises worth of data). For Motor BCIs, adopting pre-trained “few shot learner” models could significantly lower the barriers to entry. This can allow for personalized calibration with minimal data, and reduce both the time and cost involved in making Motor BCIs accessible to new Users.

For instance, Card et al. (2023) fine-tuned the pretrained model of Willet et al. (2023) for another User whom they implanted with 256 electrodes (four Utah Arrays), who was subsequently able to use the speech decoder with greater than 90% word accuracy only on the second day of use ****(after just two hours training). On the third day, the accuracy raised to above 95% with that same 125k word vocabulary (Vid. 21).

Video 21; g.tec, 2023: Presentation and demo of the BCI Award 2023 1st Place Winner Card et al., (2023).

Peterson et al. (2021), using the previously outlined dataset from Singh et al. (2020), used separate encoders for each subject, inferring generalized features suitable for Transfer Learning. Their “HTNet” method produced trained weights that primarily relied on physiologically relevant features at low frequencies near the motor cortex. During decoding, HTNet outperformed state-of-the-art on unseen participants.

Figure 45; Peterson et al., 2021 — Overview of HTNet architecture, experimental design, and electrode locations — (A) HTNet is a convolutional neural network architecture that extends EEGNet [55] (differences shown in yellow) by handling cross-participant variations in electrode placement and frequency content. The temporal convolution and Hilbert transform layers generate data-driven spectral features that can then be projected from electrodes (Elec) onto common regions of interest (ROI) using a predefined weight matrix. (B) Using electrocorticography (ECoG) data, we trained both tailored within-participant and generalized multi-participant models to decode arm movement vs. rest. Multi-participant decoders were tested separately on held-out data from unseen participants recorded with either the same modality as the train set (ECoG) or an unseen modality (EEG). We then fine-tuned these pretrained decoders using data from the test participant. © Electrode placement varies widely among the 12 ECoG participants. Electrode coverage is sparser for the 15 EEG participants compared to ECoG, but both modalities overlap in coverage of sensorimotor cortices. Asterisks denote five participants whose electrodes were mirrored from the right hemisphere.” Reprinted from Figure 1.

The ability of HTNet and other Convolutional Neural Network encoders (Lomtev et al., 2023; Petrosyan et al., 2021; Joo et al., 2023) which do not prioritize modeling population dynamics, to account for the intrinsic and input dynamics performed by the motor cortex is a far cry from the needs of a clinical Motor BCI system which must reliably disentangle exact movement-generating dynamics from those induced by other sources of nonlinearity. Combining large pre-trained datasets with models Disentangling Contextual Neural Dynamics may offer a new breed of ready-to-go decoders.

🤼Such decoders, underpinned by foundation models trained on broad data at scale to acquire a wide range of capabilities, hold great promise for Motor BCIs. These models have the potential to support inference across different Users without the necessity for fine-tuning for each individual. Such models could learn the general principles of neural activity related to motor functions and then apply this understanding to decode intentions across a diverse user base. This advancement would represent a paradigm shift in BCI technology, making it more plug-and-play and significantly lowering the threshold for adoption.

While far from ideal, the first generation of such models is coming to light. For instance, the Neural Data Transformer 2 by Ye et al. (2023) and the POYO-1 Transformer by Azabou (2023) have established the first multi-million parameter transformer models of neural activity spanning many modalities and subjects.

Integrating Transfer Learning, few-shot learning, and foundation models into the motor BCI domain could thus catalyze a transformative change, making these advanced technologies more accessible, affordable, and adaptable to a diverse array of users. By leveraging the advancements from other fields, Motor BCIs can become more user-friendly and inclusive, paving the way for their integration into everyday life.

Loudmouths (v0)

uCat is proud to support a thriving open-source community that focuses on developing high-performance neural decoders.

Our project “Loudmouths,” soon to be publicly open-sourced, is focused on developing a high-performance neural speech decoder. We are training our model on the dataset from Willett et al. (2023) and we designed our architecture to address the areas for improvement identified in the Speech and Facial Movements section.

Our model leverages PyTorch Modules adapted from ResNet and Whisper. We account for affine spatial equivariances in the time series by applying 2D convolutions to each of the electrode grids, allowing spatial relationships to be elucidated more easily than with a flattened intake block.

We use contrastive learning with simulated drift to account for representational drift by pre-training the frame encoder. Moreover, we further stabilize the representation with a contrastive learning objective that extracts intrinsic temporal manifold embeddings before sequence learning.

During initial small training experiments, we obtained promising results — observing convergence on all objectives with only 50 to 100 sentence samples — and we are deploying our model on hardware that can handle complete training runs.

To facilitate rapid iteration and experimentation, we have implemented the aforementioned as part of a Dynamic Modeling framework where model components can be easily changed. We have members researching the efficacy of Bayesian Neural Networks as an alternative to our initial design.

More information about Loudmouths (v0) will be revealed in future revisions and upcoming publications.

Our upcoming project will build on the current one to implement a Motor BCI-to-audio reconstruction with the same dataset. Moreover, we will explore the potential of sequence learning transformer models to generalize to multimodal neural datasets as a first step in building a framework for training neural foundation models.

Part 11 of a series of unedited excerpts from uCat: Transcend the Limits of Body, Time, and Space by Sam Hosovsky, Oliver Shetler*, Luke Turner, and Cai Kinnaird. First published on Feb 29th, 2024, and licensed under CC BY-NC-SA 4.0.

uCat is a community of entrepreneurs, transhumanists, techno-optimists, and many others who recognize the alignment of the technological frontiers described in this work. Join us!

*Oliver was the primary author of this excerpt.

The Cyborg Foundation Model

Upgrading Neuralink with our pretrained multimodal ensemble models

Context Identification

Inter Subject Generalization

Loudmouths (v0)

Comments

Join our community!