Incorporating R&D Workflows into Life Science Digital Transformation

Digital Transformation in the Age of AI Requires New Infrastructure

A digital transformation is underway as life science organizations work to reduce costs, increase operational efficiency, and accelerate drug development. Data is the key.  Subsequently, these organizations are focused on integrating scalable analytics, adopting artificial intelligence (AI) and machine learning (ML), and fully migrating their operations into the cloud. The ultimate objective in these initiatives is to create a culture of collaboration and experimentation to drive innovation and meet the demands of a rapidly evolving healthcare landscape, especially when faced with unpredictable events, such as the COVID-19 pandemic.

Medical imaging is an important component of this vision since it’s a rich source of patient information that can accelerate drug discovery and development to diagnose and assess disease, define and quantify biomarkers, and optimize the clinical trial process.  With megapixel upon megapixel of sub-millimeter resolution data packed into the outputs from X-rays, CAT scans, MRIs, and other modalities, medical imaging is ripe for artificial intelligence applications, especially when optimization of drug development and clinical trials are the ultimate goal.

However, the incorporation of medical imaging workflows into a digital transformation R&D ecosystem is not trivial. Domain-specific tools are necessary to access and curate large volumes of imaging data and manage complex computational algorithms and AI workflows, all while maintaining data quality, privacy and compliance. Additionally, standardization of data and analytical workflows are critical to enable collaboration. If integrated effectively, this powerful technology can greatly accelerate innovation and enable teams to meet their R&D objectives. 

In our experience working with life sciences organizations, there are common challenges that organizations face when taking on this type of digital transformation. This is not an exhaustive list of problems and solutions but rather a few infrastructure related guidelines that are important as life science companies adopt medical imaging into their digital transformation ecosystem.

Data Management is the Key to Life Sciences R&D

Problem: Consolidating data from disparate sources into a single repository

High quality data can drive better patient recruitment and engagement and lead to more efficient trials and higher quality results. With AI and modern image processing techniques, there are new opportunities to gain insights from medical imaging data.  Imaging data (mostly DICOM) originates from many disparate sources and partners including CROs, research institutions, internal stores and external real world data which are all hosted on unique systems. Life Science companies need to not only bring together data from these sources, but also make this data easily accessible for data consolidation, labeling and conversion to desired formats.  In fact, a recent IBM study reported that 80+% of effort in AI and big data projects is linked to data preparation1. The diversity and complexity of medical data types adds further difficulty and expense to data management.

Solution: A robust database and workflow to handle large volumes of data

Organizations must initially validate their data to ensure that data was completely ingested and that received data is appropriate for the research purposes.  Next, the data needs to be examined for adequate quality for optimal processing and analysis. Since healthcare data tends to be large, complex, and diverse in nature, enterprise-level scaling requires significant stress testing to ensure that the platform can on-board large numbers of active researchers. Additionally, every data access, curation, and processing action, either manual or automatic, needs to be logged and tracked to establish reproducibility and audit readiness.  

Automated workflows are also mandatory since the ingested data is in the order of tera or peta-bytes and manual processes are inefficient, time consuming, and prone to human error.  At the point of entry, data (and metadata) needs to be de-identified, classified, and quality control algorithms need to be triggered to “prep” the data for larger scale, complex analysis. Associated non-DICOM (non-imaging) data also needs to be handled with care as this data is needed for analysis. All ingested data ultimately requires a flexible, robust, searchable framework where all metadata and processes are automatically indexed and immediately available for search within the system.

Cloud Scale Computing Enables AI and Complex Analysis

Problem: Medical image processing and AI place high demand on resources 

Large scale data analysis in medical imaging often revolves around the use of multiple complex algorithms to create “pipelines”, i.e., data processing elements connected in series, where the output of one element is the input of the next one. These pipelines are necessary for image segmentation, biomarker quantification, and synthetic data creation that is in most cases applied to hundreds and thousands of data sets.  Inevitably, local IT infrastructures struggle to maintain the many algorithms and associated processing workflows, especially when developers want to fully maximize a multitude of CPUs, GPUs, and TPUs.

Solution: A cloud-scale processing infrastructure integrated with a curated database

Flexible deployment of pipelines can greatly ease the strain of development. Containerizing these pipelines (or pipeline components) reduces the IT burden to maintain these algorithms over time and promotes reproducible practices.  A processing infrastructure that can leverage local compute resources for low volume processing, combined with elastic cloud scaling for large-scale processing is a strategy that optimizes for both cost and capacity.

As life science companies look to machine learning to guide the future of their product development, machine learning workflows with comprehensive provenance for reproducibility and regulatory approvals are needed. Ideally, organizations want the ability to easily search and locate cohorts of data, train AI models, and run data conversion and quality assurance locally in an effort to scale the models in the cloud. This is a workflow that has benefitted many life science companies when working with medical imaging and other associated data sets.

Collaboration Across the Enterprise Drives Innovation

Problem: Not only are data and algorithms siloed, so are the people

Many large scale life science companies employ a vast array of scientists and engineers located in many geographies. These professionals, in many cases, need to collaborate with internal and external partners to advance an R&D initiative. Inevitably, their ability to collaborate is closely tied to their ability to share large scale data and complex processing pipelines.

Solution: Data and processing pipelines should be closely linked and in the cloud

Migrating data in the life science industry from one location to another has its share of complexities ranging from large data transfer bottlenecks to regulatory compliance. Additionally, many of these companies have teams located all over the world requiring observance of regulatory requirements for each country or region. Federation of databases across disparate regions, where computation resources are closely tied to data locality, will provide researchers with a seamless resource where data and algorithms can be accessed, eliminating the need to manage multiple databases.  Leveraging web interfaces and software development kits, users can securely access and upload to the platform, as well as process the data. Additionally, privileges to access the data and algorithms with secure controls and in compliance with regulatory constraints can be created. 

The Way Forward in Life Sciences R&D

The modern life science company is moving towards a “data-driven” operational model.  Medical imaging plays an important role in this new paradigm as the power of diagnostic tools can greatly enhance R&D discovery and clinical trial outcomes. Additional data types such as digital pathology, microscopy, and genomics are becoming complementary additions to multi-modal research adding significant values for diagnosis of complicated diseases but also creating additional complexities to the data management process. The integration of all data types as part of a digital transformation initiative requires an all-encompassing solution that can curate and organize large volumes of these data types (and related data), enable complex processing and AI pipelines, and provide the tools necessary to enhance collaboration across many teams and partners.

Author: Jim Olson, CEO, Flywheel Exchange, LLC.

To learn more about Flywheel’s enterprise-scale research data management platform and how it  enables digital transformation in the life sciences, please click here or email info@flywheel.io.

1https://www.ibm.com/cloud/blog/ibm-data-catalog-data-scientists-productivity


Leveraging Flywheel for Deep Learning Model Prediction

Since 2012, the Medical Image Computing and Computer Assisted Intervention Society (MICCAI) has put on the Brain Tumor Segmentation (BraTS) challenge with the Center for Biomedical Image Computing and Analytics (CBICA) at the Perelman School of Medicine at the University of Pennsylvania. The past eight competitions have seen rapid improvements in the automated segmentation of gliomas. This automation promises to address the most labor-intensive process required to accurately assess both the progression and effective treatment of brain tumors.

In this article, we demonstrate the power and potential of coupling the results of this competition with a FAIR (Findable, Accessible, Interoperable, Reusable) framework.  With constructing a well-labeled dataset constituting the most labor-intensive component of processing raw data, it is essential to automate this process as much as possible. We utilize Flywheel as our FAIR framework to demonstrate this process.

Flywheel (flywheel.io) is a FAIR framework that leverages the proprietary core infrastructure with open-source extensions (gears) to collect, curate, compute on, and collaborate on clinical research data. The core infrastructure of a Flywheel instance manages the collection, curation, and collaboration aspects, enabling multi-modal data to be quickly searched across an enterprise-scale collection. Each “gear” of the Flywheel ecosystem is a container-encapsulated open-source algorithm with a standardized interface. This interface enables consistent stand-alone execution or coupling with the Flywheel core infrastructure—complete with provenance of raw data, derived results, and usage records.

For the purposes of this illustration, we wrap into a gear the second-place winner of the MICCAI 2017 BraTS Challenge. This team’s entry is one of the few that has both a docker hub image and a well-documented github repository available. Their algorithm is built around both TensorFlow and NiftyNet frameworks for training and testing their Deep Learning model. As illustrated in our github repository, this “wrapping” constitutes providing the data configuration expected by their algorithm and launching their algorithm for model prediction (*).

As shown in the figure above, Flywheel provides a user-friendly interface to navigate to the MRI images expected for execution. With the required co-registered and skull-stripped MRI modalities (T1-weighted, T1-weighted with contrast, T2-weighted, and Fluid Attenuation Inversion Recovery), segmentation into distinct tissues (normal, edema, contrast enhancing, and necrosis) takes twelve minutes on our team’s Flywheel instance (see figure below). This task can take a person over an hour to segment the same tumor. When performed on a Graphical Processing Unit (GPU), this task takes less than three minutes to complete.

Segmentation into normal, edema, contrast enhancing, and necrosis tissues with the Flywheel-wrapped second place winner of the 2017 BraTS Challenge.

Although this example predictively segments the tumor of a single patient, modifications to this gear can allow tumor segmentation of multiple patients for multiple imaging sessions over the course of their care. Furthermore, with scalable cloud architecture, these tasks can be deployed in parallel, significantly reducing the overall time required to iterate inference over an entire image repository. Enacting this as a pre-curation strategy could significantly reduce the time necessary for manual labeling of clinical imaging data. 

Therein lies the vast potential benefit from using a strong FAIR framework in an AI-mediated workflow. Being able to pre-curate new data, optimize human input, and retrain on well-labeled data over accelerated time-scales. These model design, train, and test cycles are greatly facilitated by a FAIR framework, which is able to curate the data, results, and their provenance in a searchable interface.

As with this brain tumor challenge example, there are many other similar challenge events that make their algorithms and pretrained models publicly available for the research community.  One nexus of these is the Grand Challenges in Biomedical Image Analysis, hosting over 21,000 submissions in 179 challenges (56 public, 123 hidden).  Flywheel’s capacity to quickly package these algorithms to be interoperable with its framework makes it a powerful foundation for a data-driven research enterprise.

Two more useful deep learning and GPU-enabled algorithms have recently been incorporated into Flywheel gears. First, quickNAT uses default or user-supplied pre-trained deep learning models to segment neuroanatomy within thirty seconds when deployed on sufficient GPU hardware. We have wrapped a Pytorch implementation of quickNAT in a Flywheel gear. Prediction of brain regions on CPU hardware requires two hours.  Although much longer than thirty seconds needed on a GPU, it is still a fraction of the nearly twelve hours needed for FreeSurfer’s recon-all. Next, we have Nobrainer, a deep learning framework for 3D image processing. The derived Flywheel gear uses a default (or user-supplied) pre-trained model to create a whole brain mask within two minutes on a CPU. Utilizing a GPU brings this time down under thirty seconds.

The previous paragraph elicits two questions. First, with GPU model prediction times significantly faster than CPUs, when will GPU-enabled Flywheel instances be available? The next being, how can Flywheel be effectively leveraged in training deep learning models? Flywheel is actively developing GPU-deployable gears and the architecture to deliver them.  We briefly explore the second question next, leaving a more thorough investigation for another article.

Training on an extensive and diverse dataset is needed for Deep Learning models to generalize effectively and accurately across unseen data. With uncommon conditions, such as gliomas, finding enough high-quality data at a single institution can be daunting. Furthermore, sharing these data across institutional boundaries incurs the risk of exposing protected health information (PHI). With Federated Training, Deep Learning models (and their updates) are communicated across institutional boundaries to acquire the abstracted insight of distributed annotation. This eliminates the risk and requirement of transferring large data repositories while still allowing model access to a diverse dataset. With Federated Search across institutional instances of Flywheel firmly on the roadmap, this type of Federated Training of Deep Learning models will be possible within the Flywheel ecosystem.

(*) The authors of this repository and the University College London do not explicitly promote or endorse the use of Flywheel as a FAIR framework. 


Why a Research-First Platform for Imaging Informatics and Machine Learning?

It's no secret that researchers face many challenges that impede the research and development of artificial intelligence (AI) solutions in clinical settings. Machine learning requires large volumes of data for accuracy in most applications. Institutions often have a wealth of data but lack the systems needed to get it into the hands of researchers cost-effectively.

Those data must be of high quality and labeled correctly. Imaging projects often involve complex preprocessing to identify and extract features and biomarkers. To further complicate matters, security and privacy are critical, particularly when involving collaboration outside of the context of clinical care.

Unfortunately, established clinical solutions fail to address six critical needs of researchers, impeding research productivity and slowing innovation.

Multimodality

Imaging offers significant opportunities for machine learning, but imaging is often not enough. Given that so much of today's research is centered around precision medicine and opportunities to revolutionize cost and quality of care, researchers often require a 360° degree view of patients including EMR, digital pathology, EEG, -omics, and other data. Clinical imaging systems such as PACS and vendor-neutral archives (VNAs) are designed specifically for imaging and typically don't deal well with nonimaging data, particularly in the context of research workflows.

Cohorts, projects, and IRB compliance

Researchers require the ability to organize and analyze data in cohorts while enabling collaboration with others outside of the context of clinical care. Clinical imaging systems are designed for individual patient care, not for cohort or population health studies, and often lack the organizational structures required for research applications such as machine learning. Institutional review boards (IRBs) typically define for a project the scope of allowed data as well as the people authorized to work with that data. Modern research informatics systems must enable productive workflows while enforcing these IRB constraints.

Quality assurance

Machine learning can be highly sensitive to the quality of the data. Researchers must be able to confirm the quality of data, including completeness and consistency with the protocol defined for the study. Quality control and supporting documentation are required for scientific reproducibility and for processes such as U.S. Food and Drug Administration (FDA) approval. Subsequently, modern informatics systems must incorporate comprehensive support for quality assurance as part of the workflow.

Integrated labeling and annotation workflows

Machine learning depends on accurately labeled sample datasets in order to effectively train AI models. Real-world data, often originating from multiple sources, generally lack the structure and consistent labels required to directly support training. Modern imaging informatics solutions must provide the ability to efficiently organize and classify data for search and selection into the appropriate projects or machine-learning applications. Labeling workflows must be supported, including the ability to normalize classification of images and other factors such as disease indications. In the context of imaging, this may involve image annotations collected from radiologists or other experts in a consistent, machine-readable manner via blind multireader studies or similar workflows.

Automated computational workflows

Imaging and machine learning are computationally intensive activities. Research informatics platforms must automate and scale computational workflows ranging from basic image preprocessing to analytic pipelines and training AI models. The ability to rapidly define and integrate new processes using modern tools and technologies is critical for productivity and sustainability. These systems must also provide the ability to leverage diverse private cloud, public cloud, and high-performance computing (HPC) infrastructures to achieve the performance required to process large cohorts cost-effectively.

Integrated data privacy

Data privacy is critical. Compliance with regulations such as HIPAA and GDPR is a must, given the potential financial and ethical risks. However, the lack of scalable systems for ensuring data privacy is impeding researcher access to data and, therefore, slowing innovation and the related benefits. Modern research informatics solutions must systematically address data privacy. Regulations require deidentification of protected health information to the minimum level required for the intended use. However, the minimum level of identification may differ by project. Subsequently, informatics solutions must integrate deidentification and related data privacy measures in a way that can meet the needs of projects with different requirements while maintaining compliance.

Data as a strategic asset with FAIR

Data is the key to clinical research and machine learning. A scalable, systematic approach to research data management should be the foundation of research strategies aimed at machine learning and precision care. Cost-effectively scaling access to clinical data in a manner that supports research workflows while ensuring security and data privacy can improve research productivity, accelerate innovation, and enable research organizations to realize their strategic potential.

Implementing the FAIR principles in your organization helps maximize the strategic value of data that exists in your institution. These principles, developed by academics, agency professionals, and industry members, amplify the value of data by making it Findable, Accessible, Interoperable, and Reusable (FAIR).

  • Findable data are labeled and annotated with rich metadata, and the metadata are searchable.
  • Accessible data are open to researchers with the correct authorization, and the metadata persist even after data are gone.
  • Interoperable data follow standards for storing information and can operate with other metadata and systems.
  • Reusable data are well-described and well-tracked with provenance for computation and processing.

Modern informatics systems should deliver on the FAIR principles while supporting the workflow needs of researchers as described above.

A clinical research platform designed to enhance productivity and accelerate innovation

Flywheel is a new class of informatics platform that addresses the unique needs of researchers involved in imaging and machine learning. Deployed at leading research institutions around the world, Flywheel supports the entire research workflow including capture, curation, computation, and collaboration, plus compliance at each step.

Capture

Flywheel is designed for true multimodality research. While the system specializes in the unique data types and workflows associated with imaging, the platform is capable of managing nonimaging data such as EMR, digital pathology, EEG, genomics, or any other file-based data. Further, Flywheel can automate data capture from imaging modalities and also clinical PACS and VNAs to streamline research workflows as well as translational testing scenarios.

Curate

Flywheel is unique in its ability to organize and curate research data in cohort-centric projects. The platform provides extensive tools for managing metadata including classification and labeling. Quality assurance is supported through project templates and automation rules. Integrated viewers with image annotation and persistent regions of interest (ROIs) are provided to support blind multireader studies and related machine-learning workflows. Powerful search options with access to all standard or custom metadata are provided to support the FAIR principles.

Compute

Flywheel provides comprehensive tools to automate routine processing, ranging from simple preprocessing to full analytic pipelines and training machine-learning models. The platform scales computational workloads using industry-standard "containerized" applications referred to as "Gears." Gears may originate from Flywheel's Gear Exchange containing ready-to-use applications for common workflows or may be user-provided custom applications. The platform supports elastic scaling of workloads to maximize performance and productivity. Gears automate capture of provenance to support scientific reproducibility and regulatory approvals. Further, Flywheel helps you work with existing pipelines external to the system with powerful APIs and tools for leading scientific programming languages, including Python, MATLAB, and R.

Collaborate

Collaboration is enabled through secure, IRB-compliant projects. Collaboration may be within an institution or across the globe for applications such as clinical trials or multicenter studies. Flywheel projects provide role-based access controls to authorize access and control sharing of data and algorithms. Data may be reused across project boundaries for applications such as machine learning, which require as much data as possible.

Compliance

Flywheel helps reduce security and data privacy risks by providing a secure, regulatory-compliant infrastructure for systematically scaling research data management according to HIPAA and GDPR requirements. The platform provides integrated tools for deidentification of research data to ensure the protection of personal healthcare information.

A research-first platform answers the challenges to implementing AI

Flywheel's innovative research informatics platform helps you maximize the value of your data and serves as the backbone of your imaging research and machine learning strategy. Flywheel overcomes the limitations of systems designed for clinical operations to meet the unique needs of researchers. The result is improved collaboration and data sharing and reuse. Ultimately, Flywheel improves research productivity and accelerates innovation.

Original article can be found on Aunt Minnie

 


Flywheel Delivers Reproducibility

Flywheel is committed to supporting reproducible research computations.  We make many software design decisions guided by this commitment. This document explains some key reproducibility challenges and our decisions. 

Reproducibility challenges

Flywheel’s scientific advisory board member, Victoria Stodden, writes that reproducible research must enable people to check each other's work. In simpler times, research articles could provide enough information so that scientists skilled in the art could check published results by repeating the experiments and computations. But the increased complexity of modern research and software makes the methods section of a published article insufficient to support such checking. The recognition of this problem has motivated the development of many tools.

Reproducibility and data

A first requirement of reproducibility is a clear and well-defined system for sharing data and critical metadata. Data management tools are a strength of the Flywheel software. The tools go far beyond file formats and directory trees, advancing data management for reproducible research and the FAIR principles.

Through experience working with many labs, Flywheel recognized the limitations of modern tools and what new technologies might help. Many customers wanted to begin managing data the moment they were acquired rather than waiting until they were ready to upload fully analyzed results. Flywheel built tools that acquire data directly from imaging instruments - from the scanner to the database. In some MRI sites, Flywheel even acquires the raw scanner data and implements site-specific image reconstruction. The system can also store and search through an enormous range of metadata including DICOM tags as well as project-specific custom annotations and tags.

Reproducibility and containers

A second requirement of reproducibility is sharing open-source software in a repository, such as GitHub or BitBucket. Researchers, or reviewers, can read the source code and in some cases they can download, install and run it. 

Based on customer feedback, Flywheel learned that (a) downloading and installing software - even from freely available open-source code on GitHub! - can be daunting, (b) customers often had difficulty versioning and maintaining software, as students and postdocs come and go, and (c) they would run the software many times, often changing key parameters, and have difficulty keeping track of the work they had done and the work that remained to be done. 

To respond to these challenges, Flywheel implemented computational tools based on container technology (Docker and Singularity). Implementing mature algorithms in a container nearly eliminates the burden of downloading, compiling, and installing critical pieces of software.  Containers include the compiled code along with all the dependencies, such as libraries in small virtual machines that can be run on many operating systems (PC, Mac, Linux, each with different variants). These small virtual machines (containers) can be run on a local machine or on a cloud system. This eliminates the burden of having to find the code, update all the dependencies, and compile.

Reproducibility and analyses: Introducing Gears

Once an algorithm is implemented in a container, Flywheel users run it. A lot. They wanted ways to record the precise input data as well as the algorithm version parameters that were used as they explored the data. The outputs also needed to be recorded. Such a complete record is difficult for individuals to implement; having such a record is necessary for reproducibility.

Flywheel solves these problems by creating a computational system for managed application containers, which we call Gears. The Gear is structured to record every parameter needed to perform an analysis. When the user runs a Gear, the input data, specific version of the container, all the parameters needed to run the container, and the output data are all recorded in the database. This is called an ‘Analysis’ and users perform and store hundreds of Analyses on a data set.

Because all the information about an Analysis is stored in the database associated with the study, people can re-run precisely the same Gear. It is also straightforward to run the same Gear using different data, or to explore the consequences of re-running the Gear after selecting slightly different parameters. Making Analyses searchable also helps people keep track of which Gears were run and which still need to be run. 

Reproducibility and documentation

Clear writing is vitally important to making scientific work reproducible. Tools that support clear and organized notes during the experiments are also very valuable. During the initial development, Flywheel partnered with Fernando Perez and the Jupyter (then iPython) team to implement tools that built on shared software. Flywheel continues to find ways to support these tools. Flywheel tools permit users to link their data to published papers, write documentation about projects and sessions, and add notes. This documentation is part of the searchable database, and Flywheel will continue to support users to incorporate clean and thorough documentation.

 


Flywheel Delivers Data Management

Persistently storing data is the critical first step in planning for reproducible science. Defining file formats and organizing directories is a good start; in our experience this is where most researchers focus their efforts. But modern computer science provides many technologies that improve data storage, making data FAIR e.g. findable, accessible, interoperable, and reusable (see Flywheel delivers FAIR). Flywheel uses these tools in order to support reproducible science.

Metadata are important

The value of raw data, for example the numerical data of an image, is vastly increased when we know more about the data. This information - called the metadata - can tell us many important things: the instrument parameters used to acquire the data, information about the subject (demographics, medical conditions, etc.), time and place of the acquisition, and facts about the experimental context; for example, that the subject fell asleep during the resting state MR scan.  

The biomedical imaging community recognizes the importance of metadata in two important cases. First, by defining standard file formats (DICOM or NIfTI) that embed metadata into the file header. Second, the BIDS system recognizes the importance of metadata, using the file name or an accompanying file ‘sidecar’ to store useful metadata.

Storing metadata within a file header, or an accompanying file, is a good start. But using an extensible database offers many advantages. Here is why:

Databases are efficient

Nearly all modern computer operating systems use databases to store files and their metadata. For example, on Apple systems the (CMD-I) command returns metadata (‘Info’) about the file from the operating system’s database (comments, preview, kind of file) as well as standard Posix information like file size and date of access. The Apple Spotlight search uses the database to identify files.

There are many advantages to storing information about a file in a database compared to putting the information in the file header or accompanying file. For example, we have seen many cases in which people fail to keep the two files together; and sometimes they rename one of the files and lose the association between the data and metadata files. Putting the information in the file header avoids these problems but has others. Files are distributed across the disk making searches through file headers very inefficient. Also, files arise from many different sources and it is virtually impossible to guarantee that vendors keep up-to-date with changes. Headers are most useful for a particular type of file, but not for a large system.

Databases solve these problems by having the user interact with files through a unified interface that includes the name of the raw data file on disk as well as the associated metadata. To read the raw data, one consults the database for the location of the file containing the raw data. To read the metadata, one consults only the database. Typically, the database itself is small, and updates to its format or additions to its content are possible. 

Flywheel uses a document database (MongoDB) to manage user interactions with data and metadata. In the Flywheel system, you can read metadata via the web-browser interface. When programming, you can access metadata using the software development kits (SDKs) or REST API. 

Metadata can be attached to any object in the system hierarchy

The Flywheel data are organized in a hierarchy: Group, Project, Subject, Session, Acquisition, Files and Analyses. This hierarchy can incorporate virtually any file type and associated metadata. Most of our customers store files containing medical imaging data in the hierarchy, including MRI, PET, CT, OCT, and pathology images.  But some customers store other types of files, such as computer graphics files that are useful for machine learning. All of the objects, the files and the organizational containers (Project, Subject, Session, Acquisition, Analyses) are described in the database, each with its own metadata. Users can search, annotate and reuse the files and containers from any level in the Flywheel system.

Metadata are flexible

By using a general database, Flywheel can be complete and flexible. For MRI DICOM files, the database includes all of the header information in the file, such as TR, TE, voxel size, and diffusion directions. In addition, the Flywheel database includes fields for users to place searchable notes, say, about the experiment. The database can also include links to additional experimental information about the subject and auxiliary measures (often behavioral data).

The Flywheel database can add fields without needing to rebuild the entire database. For example, as new MRI technologies developed, we were able to add additional fields that describe the new acquisition parameters. Similarly, Flywheel regularly expands to manage new types of data; as we do so, we add new database fields.

Data reuse

Flywheel helps users to reuse data by (a) helping them find data sets and (b) using the search results to create a new project in their database. Adding a database entry eliminates the need for data copying - we simply copy database entries to specify the new project’s sessions, acquisitions, and files.  Flywheel calls such a virtual project a 'Collection'. 

Reproducible science 

Data management and the ability to search across all types of objects enhance the value of the data. Carefully storing and managing metadata supports finding and reusing data, two pillars of FAIR and reproducible research

Contact us here for a demonstration to see how Flywheel’s database and further computing features can be the backbone of your research.


Flywheel Delivers FAIR Principles

The FAIR acronym is a nice way to summarize four important aspirations of modern research practice: scholarly data should be Findable, Accessible, Interoperable, and Reusable. The article describing the FAIR aspirations is excellent, and we recommend reading it. Some limitations of current practice are described here. Our company was founded to advance research and we embrace these principles.

Flywheel, software used by thousands of researchers, embodies tools and technology that deliver on the FAIR principles.

About Flywheel

Flywheel is an integrated suite of software tools that (a) stores data and metadata in a searchable database, (b) includes computational tools to analyze the data, and (c) provides users with both browser-based and command line tools to manage data and perform analyses. Our customers use these tools on a range of hardware platforms: cloud systems, on-premise clusters and servers, and laptops.

Flywheel supports users throughout a project’s life cycle. The software can import data directly from the instrument (like an MR scanner) and extract metadata from the instrument files that is stored into the database. Auxiliary data from other sources can also be imported into the database. The user can view, annotate, and analyze the data, keeping track of all the scientific activities. Finally the data and analyses can be shared widely when it is time to publish the results.

FAIR Data Principals Implemented

Findable

Flywheel makes data ‘Findable’ by search and browsing. The Flywheel search tools address the entire site’s dataset, looking for data with particular features. It is straightforward, for example, to find the diffusion-weighted imaging data for female subjects between the ages of 30 and 45. The user can contact the owners of the data for access, and the data returned by a search can be placed in a virtual project (Collection) for reuse and further analysis.

Search is most effective when there are high quality metadata associated with the data and analyses. Flywheel creates a deep set of metadata by scanning the image data, classifying them. Users can attach specific searchable key words and add data-specific notes at many places - from the overall project level, the session level, the specific data file or the analyses. Users can find data by searching based on these descriptions.

Accessible

Our customers frequently observe that there is a conflict between making data accessible (sharing) while complying with health privacy rules. We live in a world with privacy officers on the one hand and open data advocates on the other.

Flywheel delivers an accessible solution that is respectful of both principles. We implemented a rigorous user-rights management system that is easy to use. Access to the data and analyses is controlled through a simple web-based interface. The system implements the different roles that are needed during a project’s life cycle. At first perhaps only the principal investigator and close collaborators have access; later, additional people (reviewers, other scientists) might be granted access to check the data and analyses. When ready, the anonymized data and full descriptions of the analyses can be made publicly viewable. An effective system that manages a project through these stages is complicated to write, but Flywheel makes the system easy-to-use through its browser interface.

Interoperable

Most scientists have felt the frustration of learning that a dataset is available, but the file format or organization of the data files requires substantial effort to decode and use. The medical imaging community has worked to reduce this burden by defining standardized file and directory organizations. Flywheel is committed to using and promoting these standards.

Our experience teaches us that well intentioned file formats and directory organizations are not enough. Flywheel stores far more information than what one finds in the header of a DICOM or NIfTI file or the BIDS directory structure. Our commitment to interoperability includes reading in files and directories in these standards and even writing Flywheel data into these formats. Beyond this, we are committed to tools that import and export data and metadata between Flywheel and other database systems.

Flywheel is further committed to supporting the interoperability of computational tools. We have opened our infrastructure so that users can analyze data using Flywheel-defined containerized algorithms, their own containers, or their own custom software. The Flywheel standards are clearly defined based on industry-standard formats (e.g., JSON, Docker, Singularity) so that other groups can use them and in this way support computational interoperability.

Reusable

From its inception, Flywheel was designed to make data reusable. Users at a center can share data within their group or across groups, they can reuse the data by combining from different groups, and create and share different computational tools. The user can select data from any project and merge it into a new project. Such reused data is called a Collection in Flywheel. The original data remain securely in place, and the user can analyze the collection as a new virtual project. All the analyses, notes, and metadata of the original data remain attached to the data as they are reused.

Equally important, the computational methods are carefully managed and reusable. Each container for algorithms is accompanied by a precise definition of its control parameters and how they were set at execution time. This combination of container and parameters is called a Flywheel Gear, and the specific Gear that was executed can be reused and shared.

More

The FAIR principles are an important part of the Flywheel system. We have also been able to design in additional functionality that supports these principles.

  • Security and data backup are very important and fundamental. The ability to import older data into the modern technology has been valuable to many of our customers.
  • The visualization tools built into Flywheel help our customers check for accuracy and data quality as soon as the data are part of the system.
  • The programming interface, supported by endpoints accessible in three different scientific programming languages, permits users to test their ideas in a way that gracefully leads to shared data and code.

Flywheel — Next Generation Research Collaboration

Millions of families are struggling with many unresolved diseases and large, complicated healthcare challenges. In neuroscience alone, billions of dollars are spent each year on Alzheimer’s, traumatic brain injury, autism, and other behavioral and neurodegenerative diseases.

The good news is that every day advances in technology and analytics are enabling scientists around the world to make amazing advances. The bad news is that the increasing complexity of new technologies is making it very difficult to share and build and build on each other’s work.

Flywheel offers a cloud-scale collaborative science platform to tackle these issues across both commercial and academic research. We are founded by a collaboration with leading universities and are solving problems every researcher faces on a regular basis. The end goal is to accelerate discovery through collaboration and reproducible research.

Let’s face it: computational science is becoming complicated. The size of the data and complexity of the data is increasing.  The tools and methods are also getting more complex. There are pressures from regulatory agencies to protect data, and pressures from funding agencies to share data, and pressures from publishing companies to promote open science. We are solving these next-generation problems.

Scientists around the world take for granted the ability to share personal data. Yet, sharing scientific data and methods in scaled and secure way it not as easy, often not even possible.  

We believe we can address these issues with a cloud-scale research platform that fosters collaboration and reproducible results.

A well-known study published in Nature states that “More than 70% of researchers have tried and failed to reproduce another scientist’s experiments, and more than half have failed to reproduce their own experiments”.  As science works through corroboration, this is a big problem and solving it will speed-up discoveries and unlock the potential of disruptive innovations.

So how do we do this?  We provide an open, extensible platform for organizing and managing diverse medical data. We are able to capture and organize large amounts of data from virtually any source to make it easy to find and use.  We leverage the power and scale of cloud infrastructure dramatically speed-up analysis. We enable collaboration and sharing in a secured and controlled way across labs, institutions, or across the globe. Basically, instead of moving huge volumes of data and replicating complex infrastructure, we bring the scientist to the data.

Flywheel targets researchers in both academic and commercial pharma/biotech. Reproducibility is critical to pharmaceutical and biotech companies as they validate any compound, biomarker, or diagnostic being considered for commercialization. Fundamentally it is about translational medicine and accelerating time to market.  We have customers today in the pharma space that are using Flywheel to manage distributed multi-site trials with academic collaborators.

Although, our initial focus is Neuroimaging, we are designed for multi-modality research.  Our team has expertise in medical imaging, specifically neuroimaging and MRI.  Research is increasingly multi-modality and multi-disciplinary, which provides a rich trajectory of adjacent markets, within imaging and also non-imaging based data.

The product is well received and our install base is growing.  We have 20 sites installed with leading institutions across the globe. We are particularly excited about our expanding work at Stanford University and Columbia University where we are moving towards institution-wide deployments. At Columbia alone, we have potential to connect up to 25 MRI systems and 100s of labs collaborating using Flywheel. We are excited about the viral nature of the product, which has led us to new opportunities at University of Pennsylvania, New York University, and others.

Stay tuned for more updates as we continue to develop Flywheel!