Why Scientists Are Opening Access to Computational Models of Biological Macromolecules
Explore the ResearchImagine a library where most books are locked away, with no card catalog to indicate what exists or where to find them. This isn't a hypothetical scenario—it's the current state of computational models of biological macromolecules, the digital blueprints that scientists use to understand life at its most fundamental level.
Computational models simulate everything from how proteins fold and drugs bind to complex cellular processes, yet they remain largely inaccessible, hidden in lab servers and unpublished datasets.
The movement to make these computational models available and discoverable represents a quiet revolution in biological research that could accelerate the pace of scientific discovery.
"It was really exciting to come together to push forward the capabilities available to humanity." 5
Biological macromolecules—proteins, DNA, RNA, and complex molecular machines—are the fundamental components that perform the work of life.
Computational models serve as virtual laboratories where scientists can test hypotheses about these molecules' behavior without the time and expense of wet-lab experiments.
These models range from atomic-level simulations that track the movement of individual atoms to coarse-grained models that simplify molecular structures to focus on specific properties.
Molecular Snapshots
In leading databases like Open Molecules 2025
Traditional experimental methods like X-ray crystallography provide static snapshots of molecules, but life is anything but static. Computational models capture the dynamic nature of biological molecules, simulating how they move, interact, and change shape over time—critical information for understanding their function.
Recent advances in artificial intelligence have dramatically enhanced these capabilities. For instance, researchers recently developed "a hybrid AI approach that can generate realistic images with the same or better quality than state-of-the-art diffusion models, but that runs about nine times faster and uses significantly less memory." 1 Such advances make complex molecular simulations more accessible to researchers without specialized computing resources.
When Digital Blueprints Stay Locked Away
Dr. Elena Rodriguez (a composite of several researchers' experiences) spent six months recreating a computational model that another lab had already built—a model she discovered only after presenting her work at a conference.
The original researchers had developed a sophisticated simulation of a protein implicated in Alzheimer's disease but had only published a summary of their findings, not the model itself. Their paper contained impressive visualizations but none of the underlying code or parameters needed to reproduce and build upon their work.
This scenario plays out daily in labs around the world, creating substantial inefficiencies that slow scientific progress.
The reasons for this accessibility gap are multifaceted:
Without access to existing models, scientists "who want to reuse prior results [are left with] three options: Re-run the analysis if the code and original source data are accessible; Re-do the analysis if only the original source data is accessible; [or] Manually or (pseudo-)automatically extract information from the data products." 3
Researchers search for existing models but often find only partial information.
Attempt to recreate models from published descriptions, often with incomplete parameters.
Test recreated models against known results to verify accuracy.
Finally begin the new research that was the original goal.
A Blueprint for Change
In May 2025, a collaboration co-led by Meta and the Department of Energy's Lawrence Berkeley National Laboratory released Open Molecules 2025 (OMol25), an unprecedented dataset of molecular simulations that exemplifies the power of open science. 5
This vast resource contains more than 100 million 3D molecular snapshots whose properties have been calculated with density functional theory (DFT), a powerful tool for modeling precise details of atomic interactions.
The scale of this project is difficult to overstate. "OMol25 cost six billion CPU hours, over ten times more than any previous dataset. To put that computational demand in perspective, it would take you over 50 years to run these calculations with 1,000 typical laptops," said project co-lead Samuel Blau. 5
Beginning with existing datasets from various chemistry specialties
Using Meta's computing infrastructure for sophisticated simulations
Identifying major types of chemistry not captured in existing collections
Developing thorough evaluations to measure and track model performance
The release of OMol25 has created what researchers call a "calculate once, use many times" ecosystem 3 , dramatically reducing duplication of effort. The dataset serves as a training ground for machine learning models that can predict molecular behavior with DFT-level accuracy but thousands of times faster. 5
| Metric | OMol25 Dataset | Previous Benchmarks |
|---|---|---|
| Number of molecular snapshots | 100+ million | 5-10 million |
| Computational cost | 6 billion CPU hours | ~500 million CPU hours |
| Average system size | Up to 350 atoms | 20-30 atoms |
| Element coverage | Most of periodic table, including heavy elements and metals | Handful of well-behaved elements |
| Primary focus areas | Biomolecules, electrolytes, metal complexes | Limited to specific molecule types |
| Model Type | Accuracy Relative to DFT | Speed vs. DFT | Best Application |
|---|---|---|---|
| Universal baseline model | ~90% | 10,000x faster | General screening |
| Specialized biomolecular | >95% | 5,000x faster | Protein-ligand interactions |
| Electrolyte-focused | ~92% | 8,000x faster | Battery development |
| Metal complex model | ~88% | 6,000x faster | Catalyst design |
Essential Resources for Molecular Modeling
The movement toward open computational models extends beyond just the models themselves to include the entire ecosystem of tools and resources needed to work with them effectively. This includes both computational tools and experimental reagents that bridge the digital and physical realms of biological research.
| Reagent Type | Function | Research Application |
|---|---|---|
| Gene editing and modulation tools 4 | Precisely alter genetic sequences | Experimental validation of predicted genetic interactions |
| Primary and secondary antibodies 4 | Detect specific proteins in complex mixtures | Confirm protein expression and localization predicted by models |
| Cell isolation and culture reagents | Maintain and propagate specific cell types | Provide biological context for molecular simulations |
| Stem cell research products | Generate specialized cell types | Create disease models for testing computational predictions |
| Protein isolation kits | Purify proteins from cellular environments | Obtain samples for experimental structure determination |
These experimental tools are essential for what computational biologists call the "independent approach" to combining computation and experimentation, where "experimental and computational protocols are performed independently, and then the results of both methods are compared." 8
Even as computational models become more sophisticated, they remain firmly rooted in experimental validation. The integration of computational predictions with experimental verification creates a powerful feedback loop that improves both approaches.
Solutions for Sharing Digital Blueprints
The solution to the accessibility problem requires both cultural and technical changes. On the technical side, researchers have proposed an analysis results data model (ARDM) that reframes "the target of analyses from static representations of the results (e.g., tables and figures) to a data model with applications in various contexts, including knowledge discovery." 3
This approach aligns with the FAIR principles—making data Findable, Accessible, Interoperable, and Reusable—that have become the gold standard for scientific data management.
Findable
Accessible
Interoperable
Reusable
Implementing these solutions requires both infrastructure and education. The ARDM framework, for instance, can be implemented using various programming languages and databases, making it adaptable to different research contexts. The proof of concept for this approach utilized "the R programming language and a relational SQLite database" 3 , common tools in computational biology.
R, Python, and other common computational tools
SQLite, PostgreSQL, and other database systems
Comprehensive metadata and process documentation
The movement to make computational models of biological macromolecules available and discoverable represents more than just a technical improvement to how scientists share their work—it's a fundamental shift toward more collaborative, efficient, and cumulative science.
Just as the invention of the printing press democratized access to knowledge, systematic sharing of computational models democratizes access to scientific discovery itself.
The benefits extend beyond academic science to patients waiting for new therapies, communities addressing environmental challenges, and societies seeking to understand the fundamental processes of life. As the OMol25 team demonstrates, when researchers come together to share resources, they create something far more valuable than any single lab could produce alone.
"I think it's going to revolutionize how people do atomistic simulations for chemistry." 5
This revolution isn't just about doing faster simulations—it's about building a comprehensive digital library of life where every discovery becomes a foundation for future breakthroughs, rather than a hidden treasure waiting to be rediscovered.