THINGSvision: The Universal Remote for Deep Neural Networks

The Bridge Between Artificial and Biological Vision

In the ongoing quest to unravel the mysteries of both human and machine intelligence, researchers face a fascinating problem: how do you peer inside the "mind" of an artificial neural network? As these models have achieved near-human object recognition, they've become invaluable tools for neuroscientists and cognitive psychologists seeking to understand our own visual system ¹ . The challenge, however, has been accessing the complex patterns of activation—the "thoughts" of the network—that arise in response to images. THINGSvision is a Python toolbox designed to solve this exact problem, acting as a universal remote control for deep neural networks and streamlining the extraction of these digital brainwaves ² ³ .

The Language of Neural Networks

To appreciate what THINGSvision does, it's essential to understand what researchers mean by "activations" or "features."

Inside the Black Box

A deep neural network is composed of layers, or "modules," each responsible for detecting increasingly complex features in an image ¹ . The initial layers might respond to simple edges or colors, while deeper layers activate in response to intricate patterns like faces or entire objects.

A Digital Signature

The activation pattern of a specific layer is a unique numerical representation of an image from the network's perspective. It's a dense mathematical signature that captures the essence of the image as the network understands it ² .

Researchers in computational neuroscience have found that these activation patterns can be surprisingly similar to the neural activity recorded from the primate brain when looking at the same image ¹ . This discovery has made feature extraction a cornerstone of modern AI research, but the process has historically been fraught with complexity.

The Bottleneck in Brain-Inspired AI Research

Before tools like THINGSvision, extracting features was a manual, error-prone, and model-specific task.

A Plethora of Models

The daily emergence of new neural network architectures—from classic models like AlexNet and VGG to modern transformers like CLIP—created an unwieldy ecosystem ¹ ² . An extraction method that worked for one model was often useless for another.

A High Barrier to Entry

For researchers without deep programming expertise, the task of ensuring correct image preprocessing, proper layer selection, and accurate alignment of images with their corresponding activations was non-trivial ² . This complexity risked errors and hindered the adoption of DNNs across interdisciplinary fields like cognitive science.

THINGSvision was born from the need to close this gap. It provides a simple, unified interface for extracting layer activations from a vast collection of models, making this powerful analysis accessible to users with little to no programming experience while also benefiting computer scientists with its efficiency and reliability ¹ ³ .

A Landmark Experiment: Relating AI and Human Vision

To illustrate its utility, let's explore a typical experiment powered by THINGSvision, designed to test the correspondence between artificial and biological vision.

Methodology: A Step-by-Step Workflow

The experimental procedure is elegantly straightforward, requiring just a few lines of code ² ⁶ .

Model & Dataset Selection

A researcher begins by selecting a pretrained deep learning model and a directory containing a custom set of images ¹ ³ .

Configuration

Key variables are defined, including the path to the images, the desired model, the output path for the features, and the computational device ¹ .

Feature Extraction

THINGSvision automatically preprocesses the images and feeds them through the network, extracting activation patterns for every image ² .

Analysis - RSA

Using Representational Similarity Analysis (RSA), researchers compare the Representational Dissimilarity Matrices (RDMs) of different systems ¹ .

Results and Analysis

The outcome of such an experiment is a direct quantitative measure of the alignment between the artificial model and biological vision.

Model-Brain Similarity

Model	Brain Region	Similarity Score (r)
CORnet-S	Inferior Temporal (IT) Cortex	0.78
AlexNet	Inferior Temporal (IT) Cortex	0.45
CLIP (ViT)	Inferior Temporal (IT) Cortex	0.72
Randomly Initialized Model	Inferior Temporal (IT) Cortex	0.15

Layer-Wise Analysis of CORnet-S

Network Layer	Corresponding Primate Brain Area	Similarity Score
V1	Primary Visual Cortex (V1)	0.82
V2	Secondary Visual Cortex (V2)	0.79
V4	Visual Area V4	0.75
IT	Inferior Temporal (IT) Cortex	0.78

The data would typically show that deep, high-level layers in models like CORnet-S have representations that closely mirror those in high-level visual areas of the primate brain, such as the Inferior Temporal (IT) cortex ¹ . This is a profound finding, suggesting that the artificial network has learned to process visual information in a way that is functionally analogous to our own visual system. Furthermore, as shown in the table, models with pretrained weights (which have "learned" from data) show a much stronger alignment with the brain than randomly initialized models, highlighting the role of learning in developing brain-like representations ² .

The Modern Scientist's Toolkit

THINGSvision democratizes access to a powerful suite of resources. The following table details the key "reagent solutions" it provides for the experimental study of deep neural networks.

Tool / Resource	Function in the "Experiment"
Model Zoo	Provides a vast library of pretrained models (AlexNet, ResNet, CLIP, CORnet), saving researchers the immense time and computational cost of training their own ¹ ³ .
Standardized Preprocessing	Automatically handles the specific image transformations (resizing, cropping, normalization) required by each model, eliminating a major source of error ³ ⁶ .
Module/Layer Selector	Allows precise targeting of any layer within a network for activation extraction, from simple edge detectors in early layers to complex object detectors in final layers ¹ .
Backend Flexibility	Seamlessly works with both PyTorch and TensorFlow, the two leading deep-learning frameworks, offering flexibility regardless of a researcher's preference ² .
RSA & CKA Integration	Includes built-in functions for Representational Similarity Analysis and Centered Kernel Alignment, the key statistical methods for comparing representations across systems ³ ⁶ .

A Future of Reproducible and Interdisciplinary Science

By simplifying the technical hurdles, THINGSvision does more than just save time—it promotes reproducibility and rigorous science. Its well-documented, standardized framework ensures that different research groups can easily replicate and build upon each other's work ¹ . This is crucial for a field progressing as rapidly as AI and computational neuroscience.

The toolbox continues to evolve, incorporating state-of-the-art models and new analysis techniques. Its ability to handle multimodal models like CLIP, which can understand both images and text, opens up new frontiers for studying the relationship between language and vision in both machines and humans ¹ ² .

THINGSvision is more than a piece of software; it is a bridge.

It connects the fields of artificial intelligence, neuroscience, and psychology, allowing us to ask and answer fundamental questions about intelligence in both silicon and biology. By giving us a standardized lens through which to view the inner workings of AI, it helps us not only to build better machines but also to better understand ourselves.