Flywheel Delivers Reproducibility

Flywheel is committed to supporting reproducible research computations.  We make many software design decisions guided by this commitment. This document explains some key reproducibility challenges and our decisions. 

Reproducibility challenges

Flywheel’s scientific advisory board member, Victoria Stodden, writes that reproducible research must enable people to check each other's work. In simpler times, research articles could provide enough information so that scientists skilled in the art could check published results by repeating the experiments and computations. But the increased complexity of modern research and software makes the methods section of a published article insufficient to support such checking. The recognition of this problem has motivated the development of many tools.

Reproducibility and data

A first requirement of reproducibility is a clear and well-defined system for sharing data and critical metadata. Data management tools are a strength of the Flywheel software. The tools go far beyond file formats and directory trees, advancing data management for reproducible research and the FAIR principles.

Through experience working with many labs, Flywheel recognized the limitations of modern tools and what new technologies might help. Many customers wanted to begin managing data the moment they were acquired rather than waiting until they were ready to upload fully analyzed results. Flywheel built tools that acquire data directly from imaging instruments - from the scanner to the database. In some MRI sites, Flywheel even acquires the raw scanner data and implements site-specific image reconstruction. The system can also store and search through an enormous range of metadata including DICOM tags as well as project-specific custom annotations and tags.

Reproducibility and containers

A second requirement of reproducibility is sharing open-source software in a repository, such as GitHub or BitBucket. Researchers, or reviewers, can read the source code and in some cases they can download, install and run it. 

Based on customer feedback, Flywheel learned that (a) downloading and installing software - even from freely available open-source code on GitHub! - can be daunting, (b) customers often had difficulty versioning and maintaining software, as students and postdocs come and go, and (c) they would run the software many times, often changing key parameters, and have difficulty keeping track of the work they had done and the work that remained to be done. 

To respond to these challenges, Flywheel implemented computational tools based on container technology (Docker and Singularity). Implementing mature algorithms in a container nearly eliminates the burden of downloading, compiling, and installing critical pieces of software.  Containers include the compiled code along with all the dependencies, such as libraries in small virtual machines that can be run on many operating systems (PC, Mac, Linux, each with different variants). These small virtual machines (containers) can be run on a local machine or on a cloud system. This eliminates the burden of having to find the code, update all the dependencies, and compile.

Reproducibility and analyses: Introducing Gears

Once an algorithm is implemented in a container, Flywheel users run it. A lot. They wanted ways to record the precise input data as well as the algorithm version parameters that were used as they explored the data. The outputs also needed to be recorded. Such a complete record is difficult for individuals to implement; having such a record is necessary for reproducibility.

Flywheel solves these problems by creating a computational system for managed application containers, which we call Gears. The Gear is structured to record every parameter needed to perform an analysis. When the user runs a Gear, the input data, specific version of the container, all the parameters needed to run the container, and the output data are all recorded in the database. This is called an ‘Analysis’ and users perform and store hundreds of Analyses on a data set.

Because all the information about an Analysis is stored in the database associated with the study, people can re-run precisely the same Gear. It is also straightforward to run the same Gear using different data, or to explore the consequences of re-running the Gear after selecting slightly different parameters. Making Analyses searchable also helps people keep track of which Gears were run and which still need to be run. 

Reproducibility and documentation

Clear writing is vitally important to making scientific work reproducible. Tools that support clear and organized notes during the experiments are also very valuable. During the initial development, Flywheel partnered with Fernando Perez and the Jupyter (then iPython) team to implement tools that built on shared software. Flywheel continues to find ways to support these tools. Flywheel tools permit users to link their data to published papers, write documentation about projects and sessions, and add notes. This documentation is part of the searchable database, and Flywheel will continue to support users to incorporate clean and thorough documentation.

 


Flywheel Delivers Data Management

Persistently storing data is the critical first step in planning for reproducible science. Defining file formats and organizing directories is a good start; in our experience this is where most researchers focus their efforts. But modern computer science provides many technologies that improve data storage, making data FAIR e.g. findable, accessible, interoperable, and reusable (see Flywheel delivers FAIR). Flywheel uses these tools in order to support reproducible science.

Metadata are important

The value of raw data, for example the numerical data of an image, is vastly increased when we know more about the data. This information - called the metadata - can tell us many important things: the instrument parameters used to acquire the data, information about the subject (demographics, medical conditions, etc.), time and place of the acquisition, and facts about the experimental context; for example, that the subject fell asleep during the resting state MR scan.  

The biomedical imaging community recognizes the importance of metadata in two important cases. First, by defining standard file formats (DICOM or NIfTI) that embed metadata into the file header. Second, the BIDS system recognizes the importance of metadata, using the file name or an accompanying file ‘sidecar’ to store useful metadata.

Storing metadata within a file header, or an accompanying file, is a good start. But using an extensible database offers many advantages. Here is why:

Databases are efficient

Nearly all modern computer operating systems use databases to store files and their metadata. For example, on Apple systems the (CMD-I) command returns metadata (‘Info’) about the file from the operating system’s database (comments, preview, kind of file) as well as standard Posix information like file size and date of access. The Apple Spotlight search uses the database to identify files.

There are many advantages to storing information about a file in a database compared to putting the information in the file header or accompanying file. For example, we have seen many cases in which people fail to keep the two files together; and sometimes they rename one of the files and lose the association between the data and metadata files. Putting the information in the file header avoids these problems but has others. Files are distributed across the disk making searches through file headers very inefficient. Also, files arise from many different sources and it is virtually impossible to guarantee that vendors keep up-to-date with changes. Headers are most useful for a particular type of file, but not for a large system.

Databases solve these problems by having the user interact with files through a unified interface that includes the name of the raw data file on disk as well as the associated metadata. To read the raw data, one consults the database for the location of the file containing the raw data. To read the metadata, one consults only the database. Typically, the database itself is small, and updates to its format or additions to its content are possible. 

Flywheel uses a document database (MongoDB) to manage user interactions with data and metadata. In the Flywheel system, you can read metadata via the web-browser interface. When programming, you can access metadata using the software development kits (SDKs) or REST API. 

Metadata can be attached to any object in the system hierarchy

The Flywheel data are organized in a hierarchy: Group, Project, Subject, Session, Acquisition, Files and Analyses. This hierarchy can incorporate virtually any file type and associated metadata. Most of our customers store files containing medical imaging data in the hierarchy, including MRI, PET, CT, OCT, and pathology images.  But some customers store other types of files, such as computer graphics files that are useful for machine learning. All of the objects, the files and the organizational containers (Project, Subject, Session, Acquisition, Analyses) are described in the database, each with its own metadata. Users can search, annotate and reuse the files and containers from any level in the Flywheel system.

Metadata are flexible

By using a general database, Flywheel can be complete and flexible. For MRI DICOM files, the database includes all of the header information in the file, such as TR, TE, voxel size, and diffusion directions. In addition, the Flywheel database includes fields for users to place searchable notes, say, about the experiment. The database can also include links to additional experimental information about the subject and auxiliary measures (often behavioral data).

The Flywheel database can add fields without needing to rebuild the entire database. For example, as new MRI technologies developed, we were able to add additional fields that describe the new acquisition parameters. Similarly, Flywheel regularly expands to manage new types of data; as we do so, we add new database fields.

Data reuse

Flywheel helps users to reuse data by (a) helping them find data sets and (b) using the search results to create a new project in their database. Adding a database entry eliminates the need for data copying - we simply copy database entries to specify the new project’s sessions, acquisitions, and files.  Flywheel calls such a virtual project a 'Collection'. 

Reproducible science 

Data management and the ability to search across all types of objects enhance the value of the data. Carefully storing and managing metadata supports finding and reusing data, two pillars of FAIR and reproducible research

Contact us here for a demonstration to see how Flywheel’s database and further computing features can be the backbone of your research.


Flywheel Delivers FAIR Principles

The FAIR acronym is a nice way to summarize four important aspirations of modern research practice: scholarly data should be Findable, Accessible, Interoperable, and Reusable. The article describing the FAIR aspirations is excellent, and we recommend reading it. Some limitations of current practice are described here. Our company was founded to advance research and we embrace these principles.

Flywheel, software used by thousands of researchers, embodies tools and technology that deliver on the FAIR principles.

About Flywheel

Flywheel is an integrated suite of software tools that (a) stores data and metadata in a searchable database, (b) includes computational tools to analyze the data, and (c) provides users with both browser-based and command line tools to manage data and perform analyses. Our customers use these tools on a range of hardware platforms: cloud systems, on-premise clusters and servers, and laptops.

Flywheel supports users throughout a project’s life cycle. The software can import data directly from the instrument (like an MR scanner) and extract metadata from the instrument files that is stored into the database. Auxiliary data from other sources can also be imported into the database. The user can view, annotate, and analyze the data, keeping track of all the scientific activities. Finally the data and analyses can be shared widely when it is time to publish the results.

FAIR Data Principals Implemented

Findable

Flywheel makes data ‘Findable’ by search and browsing. The Flywheel search tools address the entire site’s dataset, looking for data with particular features. It is straightforward, for example, to find the diffusion-weighted imaging data for female subjects between the ages of 30 and 45. The user can contact the owners of the data for access, and the data returned by a search can be placed in a virtual project (Collection) for reuse and further analysis.

Search is most effective when there are high quality metadata associated with the data and analyses. Flywheel creates a deep set of metadata by scanning the image data, classifying them. Users can attach specific searchable key words and add data-specific notes at many places - from the overall project level, the session level, the specific data file or the analyses. Users can find data by searching based on these descriptions.

Accessible

Our customers frequently observe that there is a conflict between making data accessible (sharing) while complying with health privacy rules. We live in a world with privacy officers on the one hand and open data advocates on the other.

Flywheel delivers an accessible solution that is respectful of both principles. We implemented a rigorous user-rights management system that is easy to use. Access to the data and analyses is controlled through a simple web-based interface. The system implements the different roles that are needed during a project’s life cycle. At first perhaps only the principal investigator and close collaborators have access; later, additional people (reviewers, other scientists) might be granted access to check the data and analyses. When ready, the anonymized data and full descriptions of the analyses can be made publicly viewable. An effective system that manages a project through these stages is complicated to write, but Flywheel makes the system easy-to-use through its browser interface.

Interoperable

Most scientists have felt the frustration of learning that a dataset is available, but the file format or organization of the data files requires substantial effort to decode and use. The medical imaging community has worked to reduce this burden by defining standardized file and directory organizations. Flywheel is committed to using and promoting these standards.

Our experience teaches us that well intentioned file formats and directory organizations are not enough. Flywheel stores far more information than what one finds in the header of a DICOM or NIfTI file or the BIDS directory structure. Our commitment to interoperability includes reading in files and directories in these standards and even writing Flywheel data into these formats. Beyond this, we are committed to tools that import and export data and metadata between Flywheel and other database systems.

Flywheel is further committed to supporting the interoperability of computational tools. We have opened our infrastructure so that users can analyze data using Flywheel-defined containerized algorithms, their own containers, or their own custom software. The Flywheel standards are clearly defined based on industry-standard formats (e.g., JSON, Docker, Singularity) so that other groups can use them and in this way support computational interoperability.

Reusable

From its inception, Flywheel was designed to make data reusable. Users at a center can share data within their group or across groups, they can reuse the data by combining from different groups, and create and share different computational tools. The user can select data from any project and merge it into a new project. Such reused data is called a Collection in Flywheel. The original data remain securely in place, and the user can analyze the collection as a new virtual project. All the analyses, notes, and metadata of the original data remain attached to the data as they are reused.

Equally important, the computational methods are carefully managed and reusable. Each container for algorithms is accompanied by a precise definition of its control parameters and how they were set at execution time. This combination of container and parameters is called a Flywheel Gear, and the specific Gear that was executed can be reused and shared.

More

The FAIR principles are an important part of the Flywheel system. We have also been able to design in additional functionality that supports these principles.

  • Security and data backup are very important and fundamental. The ability to import older data into the modern technology has been valuable to many of our customers.
  • The visualization tools built into Flywheel help our customers check for accuracy and data quality as soon as the data are part of the system.
  • The programming interface, supported by endpoints accessible in three different scientific programming languages, permits users to test their ideas in a way that gracefully leads to shared data and code.