The Bridge Between Artificial and Biological Vision
Explore the ToolboxIn the ongoing quest to unravel the mysteries of both human and machine intelligence, researchers face a fascinating problem: how do you peer inside the "mind" of an artificial neural network? As these models have achieved near-human object recognition, they've become invaluable tools for neuroscientists and cognitive psychologists seeking to understand our own visual system 1 . The challenge, however, has been accessing the complex patterns of activation—the "thoughts" of the network—that arise in response to images. THINGSvision is a Python toolbox designed to solve this exact problem, acting as a universal remote control for deep neural networks and streamlining the extraction of these digital brainwaves 2 3 .
To appreciate what THINGSvision does, it's essential to understand what researchers mean by "activations" or "features."
A deep neural network is composed of layers, or "modules," each responsible for detecting increasingly complex features in an image 1 . The initial layers might respond to simple edges or colors, while deeper layers activate in response to intricate patterns like faces or entire objects.
The activation pattern of a specific layer is a unique numerical representation of an image from the network's perspective. It's a dense mathematical signature that captures the essence of the image as the network understands it 2 .
Researchers in computational neuroscience have found that these activation patterns can be surprisingly similar to the neural activity recorded from the primate brain when looking at the same image 1 . This discovery has made feature extraction a cornerstone of modern AI research, but the process has historically been fraught with complexity.
Before tools like THINGSvision, extracting features was a manual, error-prone, and model-specific task.
For researchers without deep programming expertise, the task of ensuring correct image preprocessing, proper layer selection, and accurate alignment of images with their corresponding activations was non-trivial 2 . This complexity risked errors and hindered the adoption of DNNs across interdisciplinary fields like cognitive science.
THINGSvision was born from the need to close this gap. It provides a simple, unified interface for extracting layer activations from a vast collection of models, making this powerful analysis accessible to users with little to no programming experience while also benefiting computer scientists with its efficiency and reliability 1 3 .
To illustrate its utility, let's explore a typical experiment powered by THINGSvision, designed to test the correspondence between artificial and biological vision.
The experimental procedure is elegantly straightforward, requiring just a few lines of code 2 6 .
Key variables are defined, including the path to the images, the desired model, the output path for the features, and the computational device 1 .
THINGSvision automatically preprocesses the images and feeds them through the network, extracting activation patterns for every image 2 .
Using Representational Similarity Analysis (RSA), researchers compare the Representational Dissimilarity Matrices (RDMs) of different systems 1 .
The outcome of such an experiment is a direct quantitative measure of the alignment between the artificial model and biological vision.
| Model | Brain Region | Similarity Score (r) |
|---|---|---|
| CORnet-S | Inferior Temporal (IT) Cortex | 0.78 |
| AlexNet | Inferior Temporal (IT) Cortex | 0.45 |
| CLIP (ViT) | Inferior Temporal (IT) Cortex | 0.72 |
| Randomly Initialized Model | Inferior Temporal (IT) Cortex | 0.15 |
| Network Layer | Corresponding Primate Brain Area | Similarity Score |
|---|---|---|
| V1 | Primary Visual Cortex (V1) | 0.82 |
| V2 | Secondary Visual Cortex (V2) | 0.79 |
| V4 | Visual Area V4 | 0.75 |
| IT | Inferior Temporal (IT) Cortex | 0.78 |
The data would typically show that deep, high-level layers in models like CORnet-S have representations that closely mirror those in high-level visual areas of the primate brain, such as the Inferior Temporal (IT) cortex 1 . This is a profound finding, suggesting that the artificial network has learned to process visual information in a way that is functionally analogous to our own visual system. Furthermore, as shown in the table, models with pretrained weights (which have "learned" from data) show a much stronger alignment with the brain than randomly initialized models, highlighting the role of learning in developing brain-like representations 2 .
THINGSvision democratizes access to a powerful suite of resources. The following table details the key "reagent solutions" it provides for the experimental study of deep neural networks.
| Tool / Resource | Function in the "Experiment" |
|---|---|
| Model Zoo | Provides a vast library of pretrained models (AlexNet, ResNet, CLIP, CORnet), saving researchers the immense time and computational cost of training their own 1 3 . |
| Standardized Preprocessing | Automatically handles the specific image transformations (resizing, cropping, normalization) required by each model, eliminating a major source of error 3 6 . |
| Module/Layer Selector | Allows precise targeting of any layer within a network for activation extraction, from simple edge detectors in early layers to complex object detectors in final layers 1 . |
| Backend Flexibility | Seamlessly works with both PyTorch and TensorFlow, the two leading deep-learning frameworks, offering flexibility regardless of a researcher's preference 2 . |
| RSA & CKA Integration | Includes built-in functions for Representational Similarity Analysis and Centered Kernel Alignment, the key statistical methods for comparing representations across systems 3 6 . |
By simplifying the technical hurdles, THINGSvision does more than just save time—it promotes reproducibility and rigorous science. Its well-documented, standardized framework ensures that different research groups can easily replicate and build upon each other's work 1 . This is crucial for a field progressing as rapidly as AI and computational neuroscience.
The toolbox continues to evolve, incorporating state-of-the-art models and new analysis techniques. Its ability to handle multimodal models like CLIP, which can understand both images and text, opens up new frontiers for studying the relationship between language and vision in both machines and humans 1 2 .
It connects the fields of artificial intelligence, neuroscience, and psychology, allowing us to ask and answer fundamental questions about intelligence in both silicon and biology. By giving us a standardized lens through which to view the inner workings of AI, it helps us not only to build better machines but also to better understand ourselves.