Incorporating R&D Workflows into Life Science Digital Transformation

Digital Transformation in the Age of AI Requires New Infrastructure

A digital transformation is underway as life science organizations work to reduce costs, increase operational efficiency, and accelerate drug development. Data is the key.  Subsequently, these organizations are focused on integrating scalable analytics, adopting artificial intelligence (AI) and machine learning (ML), and fully migrating their operations into the cloud. The ultimate objective in these initiatives is to create a culture of collaboration and experimentation to drive innovation and meet the demands of a rapidly evolving healthcare landscape, especially when faced with unpredictable events, such as the COVID-19 pandemic.

Medical imaging is an important component of this vision since it’s a rich source of patient information that can accelerate drug discovery and development to diagnose and assess disease, define and quantify biomarkers, and optimize the clinical trial process.  With megapixel upon megapixel of sub-millimeter resolution data packed into the outputs from X-rays, CAT scans, MRIs, and other modalities, medical imaging is ripe for artificial intelligence applications, especially when optimization of drug development and clinical trials are the ultimate goal.

However, the incorporation of medical imaging workflows into a digital transformation R&D ecosystem is not trivial. Domain-specific tools are necessary to access and curate large volumes of imaging data and manage complex computational algorithms and AI workflows, all while maintaining data quality, privacy and compliance. Additionally, standardization of data and analytical workflows are critical to enable collaboration. If integrated effectively, this powerful technology can greatly accelerate innovation and enable teams to meet their R&D objectives. 

In our experience working with life sciences organizations, there are common challenges that organizations face when taking on this type of digital transformation. This is not an exhaustive list of problems and solutions but rather a few infrastructure related guidelines that are important as life science companies adopt medical imaging into their digital transformation ecosystem.

Data Management is the Key to Life Sciences R&D

Problem: Consolidating data from disparate sources into a single repository

High quality data can drive better patient recruitment and engagement and lead to more efficient trials and higher quality results. With AI and modern image processing techniques, there are new opportunities to gain insights from medical imaging data.  Imaging data (mostly DICOM) originates from many disparate sources and partners including CROs, research institutions, internal stores and external real world data which are all hosted on unique systems. Life Science companies need to not only bring together data from these sources, but also make this data easily accessible for data consolidation, labeling and conversion to desired formats.  In fact, a recent IBM study reported that 80+% of effort in AI and big data projects is linked to data preparation1. The diversity and complexity of medical data types adds further difficulty and expense to data management.

Solution: A robust database and workflow to handle large volumes of data

Organizations must initially validate their data to ensure that data was completely ingested and that received data is appropriate for the research purposes.  Next, the data needs to be examined for adequate quality for optimal processing and analysis. Since healthcare data tends to be large, complex, and diverse in nature, enterprise-level scaling requires significant stress testing to ensure that the platform can on-board large numbers of active researchers. Additionally, every data access, curation, and processing action, either manual or automatic, needs to be logged and tracked to establish reproducibility and audit readiness.  

Automated workflows are also mandatory since the ingested data is in the order of tera or peta-bytes and manual processes are inefficient, time consuming, and prone to human error.  At the point of entry, data (and metadata) needs to be de-identified, classified, and quality control algorithms need to be triggered to “prep” the data for larger scale, complex analysis. Associated non-DICOM (non-imaging) data also needs to be handled with care as this data is needed for analysis. All ingested data ultimately requires a flexible, robust, searchable framework where all metadata and processes are automatically indexed and immediately available for search within the system.

Cloud Scale Computing Enables AI and Complex Analysis

Problem: Medical image processing and AI place high demand on resources 

Large scale data analysis in medical imaging often revolves around the use of multiple complex algorithms to create “pipelines”, i.e., data processing elements connected in series, where the output of one element is the input of the next one. These pipelines are necessary for image segmentation, biomarker quantification, and synthetic data creation that is in most cases applied to hundreds and thousands of data sets.  Inevitably, local IT infrastructures struggle to maintain the many algorithms and associated processing workflows, especially when developers want to fully maximize a multitude of CPUs, GPUs, and TPUs.

Solution: A cloud-scale processing infrastructure integrated with a curated database

Flexible deployment of pipelines can greatly ease the strain of development. Containerizing these pipelines (or pipeline components) reduces the IT burden to maintain these algorithms over time and promotes reproducible practices.  A processing infrastructure that can leverage local compute resources for low volume processing, combined with elastic cloud scaling for large-scale processing is a strategy that optimizes for both cost and capacity.

As life science companies look to machine learning to guide the future of their product development, machine learning workflows with comprehensive provenance for reproducibility and regulatory approvals are needed. Ideally, organizations want the ability to easily search and locate cohorts of data, train AI models, and run data conversion and quality assurance locally in an effort to scale the models in the cloud. This is a workflow that has benefitted many life science companies when working with medical imaging and other associated data sets.

Collaboration Across the Enterprise Drives Innovation

Problem: Not only are data and algorithms siloed, so are the people

Many large scale life science companies employ a vast array of scientists and engineers located in many geographies. These professionals, in many cases, need to collaborate with internal and external partners to advance an R&D initiative. Inevitably, their ability to collaborate is closely tied to their ability to share large scale data and complex processing pipelines.

Solution: Data and processing pipelines should be closely linked and in the cloud

Migrating data in the life science industry from one location to another has its share of complexities ranging from large data transfer bottlenecks to regulatory compliance. Additionally, many of these companies have teams located all over the world requiring observance of regulatory requirements for each country or region. Federation of databases across disparate regions, where computation resources are closely tied to data locality, will provide researchers with a seamless resource where data and algorithms can be accessed, eliminating the need to manage multiple databases.  Leveraging web interfaces and software development kits, users can securely access and upload to the platform, as well as process the data. Additionally, privileges to access the data and algorithms with secure controls and in compliance with regulatory constraints can be created. 

The Way Forward in Life Sciences R&D

The modern life science company is moving towards a “data-driven” operational model.  Medical imaging plays an important role in this new paradigm as the power of diagnostic tools can greatly enhance R&D discovery and clinical trial outcomes. Additional data types such as digital pathology, microscopy, and genomics are becoming complementary additions to multi-modal research adding significant values for diagnosis of complicated diseases but also creating additional complexities to the data management process. The integration of all data types as part of a digital transformation initiative requires an all-encompassing solution that can curate and organize large volumes of these data types (and related data), enable complex processing and AI pipelines, and provide the tools necessary to enhance collaboration across many teams and partners.

Author: Jim Olson, CEO, Flywheel Exchange, LLC.

To learn more about Flywheel’s enterprise-scale research data management platform and how it  enables digital transformation in the life sciences, please click here or email info@flywheel.io.

1https://www.ibm.com/cloud/blog/ibm-data-catalog-data-scientists-productivity


Improved Collaborative Workflows with Custom Roles and Permissions

Flywheel is committed to provide customization tools for a secure, collaborative workflow. Previously, Flywheel offered fixed, predefined roles and permissions for administrators to match to site users and project collaborators. Now, administrators can have complete control over user permissions and defining roles for projects using a simple interface.

Tailor Your Workflows With Custom Roles and Permissions

The Custom Roles and Permissions interface enables you to: 

Align the Flywheel system with specific responsibilities of the users. Select user capabilities for project management, access to files and metadata, and computational permissions.

Ensure your workflow is consistent with your organization’s policies. Define roles that ensure your research process follows organizational policies for viewing, modifying, and deleting data.

Implement fine-grained control to prevent unauthorized use and reduce risk. Ensure data integrity by entrusting only specific users with the ability to modify data.

Easily coordinate  on responsibilities in multi-site collaboration. Reflect the permissions collaborators need to have while observing multiple institutional procedures.

Flexible Controls Enable a Variety of Applications

For example, here’s how you might use custom roles and permissions:

  • Data Managers in clinical trials can be restricted from viewing or modifying analyses.
  • A statistician role can be created with permissions to run gears, perform analyses but are restricted from deleting or modifying underlying data.
  • A compliance coordinator role can be created with limited permissions to view metadata and data only to ensure project contents are valid and complete.

Powerful Controls and Easy-to-Use

Custom roles are defined at the site level, enabling consistency in permission sets across the site. Controls over Flywheel permissions include a user’s level of access with data, which data permissions apply to, and other key operations like running analysis or downloading data. 

Creating an "Analyst” role with limited project permissions but has the ability to work with analyses

Research groups may then select from the site’s defined roles for the roles that fit their workflow. Users are assigned a specific role or multiple roles at the project level.

Setting roles at the project level - Note that users can be assigned multiple roles

 

You may find additional information about setting User Roles & Permissions in our documentation.


Advanced Search for Finding and Repurposing Data

Flywheel has released a powerful new Advanced Search capability in version 11.2. This tool extends previous search functionality to allow users to construct complex queries and quickly pinpoint the data they need. 

Key features include:

  • The ability to search any metadata including ROIs
  • New SQL-like query language for complex AND/OR queries
  • An easy-to-use visual query builder
  • The ability to save and manage queries

Advanced Search enables a variety of applications including:

Exploring project data to ensure consistency and quality.  Search on any metadata to find cases that meet or don’t meet required criteria using metadata on any object in the Flywheel database.  

Finding and repurposing data from multiple projects for secondary applications and research.  Search for relevant data sets meeting requirements for new applications using standard metadata, custom metadata and attributes of sessions and experiments. 

Creating machine learning data sets.  Create training sets from search results using metadata and ROIs created using Flywheel’s integrated DICOM viewer, for example.

Accessing Advanced Search

To access Advanced Search, simply click ‘Advanced Search’ in the left navigation panel from the search results screen.

A Simple, Powerful Visual Query Builder

Flywheel’s new Visual Query Builder makes it easy to construct complex queries combining search terms for projects, subjects, Gears, and file metadata.

Metadata fields can be easily added to search queries from within the Visual Query Builder. As you click to define the field you are searching for, Flywheel offers dropdowns for easy selection. 

Powerful SQL-Like Query Language

Users that are more familiar with SQL can manually construct queries using Flywheel’s simplified query language: FlyQL.  FlyQL enables access to all metadata, including DICOM tags and custom metadata.

While clicking through the Visual Query Builder, you will see a FlyQL query being built on the left. You may choose to write your queries in this editor as well. The type-ahead feature, which suggests text to autocomplete the query, allows you to quickly find the data points you need.

Manage and Save Queries

After easily constructing a specific query, click Save Query to preview, share, and reuse the query. Find your saved queries below the FlyQL Query Editor.

Managing Search Results 

For the following query looking for male Subjects between 40 and 80 years, these example results are obtained.

subject.sex = male AND session.age_in_years >= 40 AND session.age_in_years <= 80

Use the left navigation panel to further filter results. You may choose to see results in Sessions, Acquisitions, Files or Analyses at the top.

After selecting results, users may click the Actions dropdown to choose to download selected results, add them to a collection, or run a batch gear on search results. 

Search Image Annotations and ROIs

AI Developers may search for annotated images, including by regions of interest (ROIs), and create machine learning training sets with ease. 

file.type = dicom AND file.info.roi.label = Lesion AND file.info.roi.area > 300

Additional Examples

Project.label IN [“PSY Study”, “NIMH Project”, “JBM Project”] 

AND session.created >= 2020-02-01 

AND session.created <= 2020-02-29 

AND session.satisfies_template is false

 

Search for all sessions in three projects created over the last month that do not adhere to their project’s template

analysis.label CONTAINS afq AND file.name CONTAINS analysis_summary 

AND subject.cohort = Control

 

Search for analysis summary reports for all Automated Fiber Quantification analyses performed on a subject cohort.

You may find additional information about Advanced Search in our documentation. Please reach out to our Support Team with any questions about this feature or about other updates in 11.2.

 


Computing with Flywheel

I am often asked to explain how Flywheel supports a broad range computational workflows, including:

  • Working with existing pipelines
  • Exploratory development and analysis
  • Automating routine processing

Flywheel offers an open and extensible approach that provides you the flexibility to work in the manner that makes sense for your lab or project.

Working with existing processing pipelines

The simplest approach for working with existing pipelines involves downloading the required data from Flywheel and processing it as usual. Flywheel provides several download options including the web-based UI and command line tools. For more control over selecting and formatting data, Flywheel provides easy-to-use programming interfaces for use with leading scientific languages including Python, MATLAB, and R. These may be used to access, format, and download any data or metadata in the Flywheel database.  

Exploratory Development and Analysis

For developing new algorithms or pipelines, Flywheel’s Python and MATLAB SDKs provide a powerful alternative to downloading to disk. Using the SDKs, a Python or MATLAB user may work with data in Flywheel directly from their preferred scripting language. Full search is available along with simple commands for reading and writing data and metadata.

Routine Processing with Plug-In Applications (Gears)

Gears are plug-in applications that automate routine tasks, including metadata extraction, classification, quality assurance, format conversion, and full analytic pipelines.  Here’s how gears work:

Leveraging Standard OCI-Compliant Containers

From a technical perspective, Gears are applications running in standard OCI-compliant (Docker, Singularity, etc.) containers that are managed by Flywheel. A container typically contains application code and all of its dependencies to create a portable, reproducible unit of processing. Containers can be easily made into Gears with the addition of metadata that explains to Flywheel how to use the containerized applications. This metadata is expressed via a simple JSON file that includes descriptive metadata, such as links to source code, authors, etc. It also includes instructions for passing in data, configuration options, and how to execute commands in the container.

Automating and Scaling Gear Execution

Gears may be run in a variety of ways. They may be executed on demand for a given data set.  They may also be run in batch mode for a selected collection of data sets. In these cases, the user is prompted for inputs prior to execution. Gears may also be run automatically by rules configured for the project.  For example, when a DICOM series is uploaded, it can be classified and converted to NIfTI, if it is imaging data. Gear rules may be used to automate routine pre-processing as well as trigger complex pipelines. Gears may be scheduled by tasks outside of Flywheel using the command line tool (CLI) or programming interfaces. Finally, when deployed in cloud or private cloud infrastructures, Flywheel can dynamically scale resources to maximize parallel processing to save you time.

Process Any Level of Data in Your Project

Gears may be designed to process data at different levels of the Flywheel project hierarchy.  Gears may process individual sessions (exams/DICOM studies). For longitudinal studies, Gears may be used to process at the subject (participant/patient) level with the ability to process data from multiple sessions. Finally, project-level Gears may be used to perform group/cohort analyses across all subjects.  

Automated Provenance

A key advantage of using Gears to manage routine processing is the documentation that results. Everytime a Gear is run, Flywheel records a great deal of derivative information that supports consistency and reproducibility of your project. These “Analysis” documents record Gear version, who ran it, when it ran, success/fail status, inputs, configuration options used, and outputs produced.  Further, they may be annotated with notes or structured JSON metadata to meet your project needs. This provenance makes it easy to ensure that all necessary processing steps were performed and performed consistently.

Flywheel Gear Exchange

To speed project deployment, Flywheel provides a library of commonly used algorithms as Gears via the Flywheel Gear Exchange. The Gear Exchange currently contains roughly 70 Gears contributed by Flywheel or Flywheel users. Examples include DICOM-to-NIfTI conversion, Freesurfer Recon-All, the Human Connectome Pipelines, and commonly used BIDS applications, such as MRIQC and FMRIPrep. The Gear Exchange provides a powerful way to share reproducible units of code that may be used as building blocks for new projects.

User-Developed Custom Gears

Users may easily create their own Gears as well. Gear developers simply get their code running in an OCI-compatible container and provide the gear metadata. Applications may be developed in any language. Flywheel’s APIs and SDKs may be used in a Gear if needed, otherwise, the containerized application need not be Flywheel aware. 

Flywheel streamlines the process of creating the Gear metadata via the CLI Gear Builder tool which prompts the user through the required information and generates most of the metadata automatically. The resulting Gears may be shared with other Flywheel sites via the Flywheel Gear Exchange, or may be kept private by uploading them only to the user’s site. Flywheel does not make any claim on any of the intellectual property in customer Gears.

Conclusion

Flywheel makes it easy to work the way you want. Our open CLI, APIs, and SDKs make it easy to download data and use existing processes. Our Gears framework allows you to automate routine processing consistently with extensive documentation to support quality and reproducibility.

Read more about our scientific collaborations or send us your questions!


Saving Time and Money with Flywheel HPC Integration 

Flywheel, a comprehensive research data platform for medical imaging, machine learning, and clinical trials, recently rolled out a beta integration feature for High Performance Computing (HPC) clusters, including Slurm and SGE. With this new feature, Flywheel supports computing on HPC clusters, in addition to more traditional virtual machine (VM)-based deployments in the cloud or on-premises.

As capital investments (frequently in the millions of dollars), HPC systems constitute an enormous opportunity as a local, shared resource for an organization. At the same time, these systems are often difficult or confusing to use, due to their specialized nature and older technology base. Access is frequently restrictive, and the workflow for running software on an HPC cluster is significantly different from a traditional machine, due to the "drop off & pick up" nature of the interaction. 

Furthermore, debugging tends to have an extremely long turnaround time, due to the system's fluctuating and inscrutable job queue. This tends to result in idle cluster capacity. Flywheel can increase HPC utilization, by making the system more accessible to a large user community.

During beta testing alone, one Flywheel customer estimated savings in excess of $7000 over a period of one month, by moving some of their compute-intensive workloads from cloud hosting to their university-sponsored HPC. In that time, over four months of single-machine, eight-core work transpired. Much of that capacity would otherwise have sat idle on their cluster.

With Flywheel, scientific algorithms run in OCI-compliant (Docker, etc.) containers, called Gears. When using Flywheel with the new HPC integration, customers work directly with us to whitelist specific Gears for this feature, but still access the same point-and-click experience available to all Gears. The Flywheel system translates the request into the system-specific format, submits the HPC job, using the Singularity container runtime, waits for the HPC queue to pick up the work, and marshals input and output data to and from the system.

The result is that all of Flywheel’s computation management features - such as batch jobs, SDK integration, and Gear Rules - work out of the box on HPC systems or local hardware, with great potential for improving productivity and reducing costs.

Ted Satterthwaite, MD, Assistant Professor in the Department of Psychiatry at the University of Pennsylvania Perelman School of Medicine

“The Flywheel integration with the HPC at Penn has been a total game-changer. It allows us to leverage the complementary advantages of two powerful systems. By launching compute jobs as containerized Gears through Flywheel, we can ensure total reproducibility. Furthermore, by integrating Flywheel with the massive computational resources provided by the Penn HPC, run by Christos Davatzikos, we can run computationally demanding jobs at scale across large samples without worrying about cloud compute charges. Throughout, the Flywheel engineering team was incredibly responsive; it was really a model for successful collaboration.”

– Ted Satterthwaite, MD, Assistant Professor in the Department of Psychiatry at the University of Pennsylvania Perelman School of Medicine


Leveraging Flywheel for Deep Learning Model Prediction

Since 2012, the Medical Image Computing and Computer Assisted Intervention Society (MICCAI) has put on the Brain Tumor Segmentation (BraTS) challenge with the Center for Biomedical Image Computing and Analytics (CBICA) at the Perelman School of Medicine at the University of Pennsylvania. The past eight competitions have seen rapid improvements in the automated segmentation of gliomas. This automation promises to address the most labor-intensive process required to accurately assess both the progression and effective treatment of brain tumors.

In this article, we demonstrate the power and potential of coupling the results of this competition with a FAIR (Findable, Accessible, Interoperable, Reusable) framework.  With constructing a well-labeled dataset constituting the most labor-intensive component of processing raw data, it is essential to automate this process as much as possible. We utilize Flywheel as our FAIR framework to demonstrate this process.

Flywheel (flywheel.io) is a FAIR framework that leverages the proprietary core infrastructure with open-source extensions (gears) to collect, curate, compute on, and collaborate on clinical research data. The core infrastructure of a Flywheel instance manages the collection, curation, and collaboration aspects, enabling multi-modal data to be quickly searched across an enterprise-scale collection. Each “gear” of the Flywheel ecosystem is a container-encapsulated open-source algorithm with a standardized interface. This interface enables consistent stand-alone execution or coupling with the Flywheel core infrastructure—complete with provenance of raw data, derived results, and usage records.

For the purposes of this illustration, we wrap into a gear the second-place winner of the MICCAI 2017 BraTS Challenge. This team’s entry is one of the few that has both a docker hub image and a well-documented github repository available. Their algorithm is built around both TensorFlow and NiftyNet frameworks for training and testing their Deep Learning model. As illustrated in our github repository, this “wrapping” constitutes providing the data configuration expected by their algorithm and launching their algorithm for model prediction (*).

As shown in the figure above, Flywheel provides a user-friendly interface to navigate to the MRI images expected for execution. With the required co-registered and skull-stripped MRI modalities (T1-weighted, T1-weighted with contrast, T2-weighted, and Fluid Attenuation Inversion Recovery), segmentation into distinct tissues (normal, edema, contrast enhancing, and necrosis) takes twelve minutes on our team’s Flywheel instance (see figure below). This task can take a person over an hour to segment the same tumor. When performed on a Graphical Processing Unit (GPU), this task takes less than three minutes to complete.

Segmentation into normal, edema, contrast enhancing, and necrosis tissues with the Flywheel-wrapped second place winner of the 2017 BraTS Challenge.

Although this example predictively segments the tumor of a single patient, modifications to this gear can allow tumor segmentation of multiple patients for multiple imaging sessions over the course of their care. Furthermore, with scalable cloud architecture, these tasks can be deployed in parallel, significantly reducing the overall time required to iterate inference over an entire image repository. Enacting this as a pre-curation strategy could significantly reduce the time necessary for manual labeling of clinical imaging data. 

Therein lies the vast potential benefit from using a strong FAIR framework in an AI-mediated workflow. Being able to pre-curate new data, optimize human input, and retrain on well-labeled data over accelerated time-scales. These model design, train, and test cycles are greatly facilitated by a FAIR framework, which is able to curate the data, results, and their provenance in a searchable interface.

As with this brain tumor challenge example, there are many other similar challenge events that make their algorithms and pretrained models publicly available for the research community.  One nexus of these is the Grand Challenges in Biomedical Image Analysis, hosting over 21,000 submissions in 179 challenges (56 public, 123 hidden).  Flywheel’s capacity to quickly package these algorithms to be interoperable with its framework makes it a powerful foundation for a data-driven research enterprise.

Two more useful deep learning and GPU-enabled algorithms have recently been incorporated into Flywheel gears. First, quickNAT uses default or user-supplied pre-trained deep learning models to segment neuroanatomy within thirty seconds when deployed on sufficient GPU hardware. We have wrapped a Pytorch implementation of quickNAT in a Flywheel gear. Prediction of brain regions on CPU hardware requires two hours.  Although much longer than thirty seconds needed on a GPU, it is still a fraction of the nearly twelve hours needed for FreeSurfer’s recon-all. Next, we have Nobrainer, a deep learning framework for 3D image processing. The derived Flywheel gear uses a default (or user-supplied) pre-trained model to create a whole brain mask within two minutes on a CPU. Utilizing a GPU brings this time down under thirty seconds.

The previous paragraph elicits two questions. First, with GPU model prediction times significantly faster than CPUs, when will GPU-enabled Flywheel instances be available? The next being, how can Flywheel be effectively leveraged in training deep learning models? Flywheel is actively developing GPU-deployable gears and the architecture to deliver them.  We briefly explore the second question next, leaving a more thorough investigation for another article.

Training on an extensive and diverse dataset is needed for Deep Learning models to generalize effectively and accurately across unseen data. With uncommon conditions, such as gliomas, finding enough high-quality data at a single institution can be daunting. Furthermore, sharing these data across institutional boundaries incurs the risk of exposing protected health information (PHI). With Federated Training, Deep Learning models (and their updates) are communicated across institutional boundaries to acquire the abstracted insight of distributed annotation. This eliminates the risk and requirement of transferring large data repositories while still allowing model access to a diverse dataset. With Federated Search across institutional instances of Flywheel firmly on the roadmap, this type of Federated Training of Deep Learning models will be possible within the Flywheel ecosystem.

(*) The authors of this repository and the University College London do not explicitly promote or endorse the use of Flywheel as a FAIR framework. 


Why a Research-First Platform for Imaging Informatics and Machine Learning?

It's no secret that researchers face many challenges that impede the research and development of artificial intelligence (AI) solutions in clinical settings. Machine learning requires large volumes of data for accuracy in most applications. Institutions often have a wealth of data but lack the systems needed to get it into the hands of researchers cost-effectively.

Those data must be of high quality and labeled correctly. Imaging projects often involve complex preprocessing to identify and extract features and biomarkers. To further complicate matters, security and privacy are critical, particularly when involving collaboration outside of the context of clinical care.

Unfortunately, established clinical solutions fail to address six critical needs of researchers, impeding research productivity and slowing innovation.

Multimodality

Imaging offers significant opportunities for machine learning, but imaging is often not enough. Given that so much of today's research is centered around precision medicine and opportunities to revolutionize cost and quality of care, researchers often require a 360° degree view of patients including EMR, digital pathology, EEG, -omics, and other data. Clinical imaging systems such as PACS and vendor-neutral archives (VNAs) are designed specifically for imaging and typically don't deal well with nonimaging data, particularly in the context of research workflows.

Cohorts, projects, and IRB compliance

Researchers require the ability to organize and analyze data in cohorts while enabling collaboration with others outside of the context of clinical care. Clinical imaging systems are designed for individual patient care, not for cohort or population health studies, and often lack the organizational structures required for research applications such as machine learning. Institutional review boards (IRBs) typically define for a project the scope of allowed data as well as the people authorized to work with that data. Modern research informatics systems must enable productive workflows while enforcing these IRB constraints.

Quality assurance

Machine learning can be highly sensitive to the quality of the data. Researchers must be able to confirm the quality of data, including completeness and consistency with the protocol defined for the study. Quality control and supporting documentation are required for scientific reproducibility and for processes such as U.S. Food and Drug Administration (FDA) approval. Subsequently, modern informatics systems must incorporate comprehensive support for quality assurance as part of the workflow.

Integrated labeling and annotation workflows

Machine learning depends on accurately labeled sample datasets in order to effectively train AI models. Real-world data, often originating from multiple sources, generally lack the structure and consistent labels required to directly support training. Modern imaging informatics solutions must provide the ability to efficiently organize and classify data for search and selection into the appropriate projects or machine-learning applications. Labeling workflows must be supported, including the ability to normalize classification of images and other factors such as disease indications. In the context of imaging, this may involve image annotations collected from radiologists or other experts in a consistent, machine-readable manner via blind multireader studies or similar workflows.

Automated computational workflows

Imaging and machine learning are computationally intensive activities. Research informatics platforms must automate and scale computational workflows ranging from basic image preprocessing to analytic pipelines and training AI models. The ability to rapidly define and integrate new processes using modern tools and technologies is critical for productivity and sustainability. These systems must also provide the ability to leverage diverse private cloud, public cloud, and high-performance computing (HPC) infrastructures to achieve the performance required to process large cohorts cost-effectively.

Integrated data privacy

Data privacy is critical. Compliance with regulations such as HIPAA and GDPR is a must, given the potential financial and ethical risks. However, the lack of scalable systems for ensuring data privacy is impeding researcher access to data and, therefore, slowing innovation and the related benefits. Modern research informatics solutions must systematically address data privacy. Regulations require deidentification of protected health information to the minimum level required for the intended use. However, the minimum level of identification may differ by project. Subsequently, informatics solutions must integrate deidentification and related data privacy measures in a way that can meet the needs of projects with different requirements while maintaining compliance.

Data as a strategic asset with FAIR

Data is the key to clinical research and machine learning. A scalable, systematic approach to research data management should be the foundation of research strategies aimed at machine learning and precision care. Cost-effectively scaling access to clinical data in a manner that supports research workflows while ensuring security and data privacy can improve research productivity, accelerate innovation, and enable research organizations to realize their strategic potential.

Implementing the FAIR principles in your organization helps maximize the strategic value of data that exists in your institution. These principles, developed by academics, agency professionals, and industry members, amplify the value of data by making it Findable, Accessible, Interoperable, and Reusable (FAIR).

  • Findable data are labeled and annotated with rich metadata, and the metadata are searchable.
  • Accessible data are open to researchers with the correct authorization, and the metadata persist even after data are gone.
  • Interoperable data follow standards for storing information and can operate with other metadata and systems.
  • Reusable data are well-described and well-tracked with provenance for computation and processing.

Modern informatics systems should deliver on the FAIR principles while supporting the workflow needs of researchers as described above.

A clinical research platform designed to enhance productivity and accelerate innovation

Flywheel is a new class of informatics platform that addresses the unique needs of researchers involved in imaging and machine learning. Deployed at leading research institutions around the world, Flywheel supports the entire research workflow including capture, curation, computation, and collaboration, plus compliance at each step.

Capture

Flywheel is designed for true multimodality research. While the system specializes in the unique data types and workflows associated with imaging, the platform is capable of managing nonimaging data such as EMR, digital pathology, EEG, genomics, or any other file-based data. Further, Flywheel can automate data capture from imaging modalities and also clinical PACS and VNAs to streamline research workflows as well as translational testing scenarios.

Curate

Flywheel is unique in its ability to organize and curate research data in cohort-centric projects. The platform provides extensive tools for managing metadata including classification and labeling. Quality assurance is supported through project templates and automation rules. Integrated viewers with image annotation and persistent regions of interest (ROIs) are provided to support blind multireader studies and related machine-learning workflows. Powerful search options with access to all standard or custom metadata are provided to support the FAIR principles.

Compute

Flywheel provides comprehensive tools to automate routine processing, ranging from simple preprocessing to full analytic pipelines and training machine-learning models. The platform scales computational workloads using industry-standard "containerized" applications referred to as "Gears." Gears may originate from Flywheel's Gear Exchange containing ready-to-use applications for common workflows or may be user-provided custom applications. The platform supports elastic scaling of workloads to maximize performance and productivity. Gears automate capture of provenance to support scientific reproducibility and regulatory approvals. Further, Flywheel helps you work with existing pipelines external to the system with powerful APIs and tools for leading scientific programming languages, including Python, MATLAB, and R.

Collaborate

Collaboration is enabled through secure, IRB-compliant projects. Collaboration may be within an institution or across the globe for applications such as clinical trials or multicenter studies. Flywheel projects provide role-based access controls to authorize access and control sharing of data and algorithms. Data may be reused across project boundaries for applications such as machine learning, which require as much data as possible.

Compliance

Flywheel helps reduce security and data privacy risks by providing a secure, regulatory-compliant infrastructure for systematically scaling research data management according to HIPAA and GDPR requirements. The platform provides integrated tools for deidentification of research data to ensure the protection of personal healthcare information.

A research-first platform answers the challenges to implementing AI

Flywheel's innovative research informatics platform helps you maximize the value of your data and serves as the backbone of your imaging research and machine learning strategy. Flywheel overcomes the limitations of systems designed for clinical operations to meet the unique needs of researchers. The result is improved collaboration and data sharing and reuse. Ultimately, Flywheel improves research productivity and accelerates innovation.

Original article can be found on Aunt Minnie

 


Four AI Workflow Trends from RSNA 2019

The Biggest Trend: Maturing Implementation of AI

Attendees who visited our booth last year were interested in learning about AI capabilities. This year they were bringing questions about implementing infrastructure needed for AI and how to scale AI research in their organizations. Scaling access to clinical data and interoperability appears to be a rising concern this year. Organizations are also gradually accepting cloud scaling as a secure option.

Radiologists are beginning to plan for AI in their standard workflows. There were many radiologists in our booth asking questions with respect to AI research in their current clinical workflows.

Data Curation for Research Still Falls Short

The focus in many workshops and presentations from radiologists was “data wrangling” and data set quality. We received many questions from attendees regarding metadata management and labelling tools. At the same time there is growing recognition that clinical systems don’t meet the needs of the research and AI development communities. Additionally, an entirely new class of solution that supports the research workflow is needed.

We recommend Dr. Paul Chang’s (University of Chicago) AuntMinnie interview during RSNA: “AI is like a great car … Most cars still need gas and roads. In the context of this analogy, gas is vetted data and the road is workflow orchestration that is AI-enabled... The only way to make a transformative technology real is to do the boring stuff, the infrastructure stuff.”

Everyone Noticed the Busy AI Showcase

The AI Showcase was very active this year. In 2018, there were roughly 70 vendors in the AI Showcase, but this year there were 129, including many international AI vendors. We noticed growth in AI development for cardiac and brain imaging.

It’s Imminent: Equipment Vendors are Integrating AI Workflows

AI is moving beyond the desktop as imaging equipment manufacturers have their eye on supporting research workflows. Leading equipment manufacturers like Philips and Canon displayed developments in their interfaces to support AI or analysis tools in a disease specific applications. Flywheel is expanding partnerships with AI vendors and equipment vendors in addition to supporting clients performing imaging and clinical research.

CEO Travis Richardson presenting at the Google Cloud Booth about Flywheel’s scalable infrastructure for machine learning.

Flywheel Delivers Reproducibility

Flywheel is committed to supporting reproducible research computations.  We make many software design decisions guided by this commitment. This document explains some key reproducibility challenges and our decisions. 

Reproducibility challenges

Flywheel’s scientific advisory board member, Victoria Stodden, writes that reproducible research must enable people to check each other's work. In simpler times, research articles could provide enough information so that scientists skilled in the art could check published results by repeating the experiments and computations. But the increased complexity of modern research and software makes the methods section of a published article insufficient to support such checking. The recognition of this problem has motivated the development of many tools.

Reproducibility and data

A first requirement of reproducibility is a clear and well-defined system for sharing data and critical metadata. Data management tools are a strength of the Flywheel software. The tools go far beyond file formats and directory trees, advancing data management for reproducible research and the FAIR principles.

Through experience working with many labs, Flywheel recognized the limitations of modern tools and what new technologies might help. Many customers wanted to begin managing data the moment they were acquired rather than waiting until they were ready to upload fully analyzed results. Flywheel built tools that acquire data directly from imaging instruments - from the scanner to the database. In some MRI sites, Flywheel even acquires the raw scanner data and implements site-specific image reconstruction. The system can also store and search through an enormous range of metadata including DICOM tags as well as project-specific custom annotations and tags.

Reproducibility and containers

A second requirement of reproducibility is sharing open-source software in a repository, such as GitHub or BitBucket. Researchers, or reviewers, can read the source code and in some cases they can download, install and run it. 

Based on customer feedback, Flywheel learned that (a) downloading and installing software - even from freely available open-source code on GitHub! - can be daunting, (b) customers often had difficulty versioning and maintaining software, as students and postdocs come and go, and (c) they would run the software many times, often changing key parameters, and have difficulty keeping track of the work they had done and the work that remained to be done. 

To respond to these challenges, Flywheel implemented computational tools based on container technology (Docker and Singularity). Implementing mature algorithms in a container nearly eliminates the burden of downloading, compiling, and installing critical pieces of software.  Containers include the compiled code along with all the dependencies, such as libraries in small virtual machines that can be run on many operating systems (PC, Mac, Linux, each with different variants). These small virtual machines (containers) can be run on a local machine or on a cloud system. This eliminates the burden of having to find the code, update all the dependencies, and compile.

Reproducibility and analyses: Introducing Gears

Once an algorithm is implemented in a container, Flywheel users run it. A lot. They wanted ways to record the precise input data as well as the algorithm version parameters that were used as they explored the data. The outputs also needed to be recorded. Such a complete record is difficult for individuals to implement; having such a record is necessary for reproducibility.

Flywheel solves these problems by creating a computational system for managed application containers, which we call Gears. The Gear is structured to record every parameter needed to perform an analysis. When the user runs a Gear, the input data, specific version of the container, all the parameters needed to run the container, and the output data are all recorded in the database. This is called an ‘Analysis’ and users perform and store hundreds of Analyses on a data set.

Because all the information about an Analysis is stored in the database associated with the study, people can re-run precisely the same Gear. It is also straightforward to run the same Gear using different data, or to explore the consequences of re-running the Gear after selecting slightly different parameters. Making Analyses searchable also helps people keep track of which Gears were run and which still need to be run. 

Reproducibility and documentation

Clear writing is vitally important to making scientific work reproducible. Tools that support clear and organized notes during the experiments are also very valuable. During the initial development, Flywheel partnered with Fernando Perez and the Jupyter (then iPython) team to implement tools that built on shared software. Flywheel continues to find ways to support these tools. Flywheel tools permit users to link their data to published papers, write documentation about projects and sessions, and add notes. This documentation is part of the searchable database, and Flywheel will continue to support users to incorporate clean and thorough documentation.

 


Flywheel Delivers Data Management

Persistently storing data is the critical first step in planning for reproducible science. Defining file formats and organizing directories is a good start; in our experience this is where most researchers focus their efforts. But modern computer science provides many technologies that improve data storage, making data FAIR e.g. findable, accessible, interoperable, and reusable (see Flywheel delivers FAIR). Flywheel uses these tools in order to support reproducible science.

Metadata are important

The value of raw data, for example the numerical data of an image, is vastly increased when we know more about the data. This information - called the metadata - can tell us many important things: the instrument parameters used to acquire the data, information about the subject (demographics, medical conditions, etc.), time and place of the acquisition, and facts about the experimental context; for example, that the subject fell asleep during the resting state MR scan.  

The biomedical imaging community recognizes the importance of metadata in two important cases. First, by defining standard file formats (DICOM or NIfTI) that embed metadata into the file header. Second, the BIDS system recognizes the importance of metadata, using the file name or an accompanying file ‘sidecar’ to store useful metadata.

Storing metadata within a file header, or an accompanying file, is a good start. But using an extensible database offers many advantages. Here is why:

Databases are efficient

Nearly all modern computer operating systems use databases to store files and their metadata. For example, on Apple systems the (CMD-I) command returns metadata (‘Info’) about the file from the operating system’s database (comments, preview, kind of file) as well as standard Posix information like file size and date of access. The Apple Spotlight search uses the database to identify files.

There are many advantages to storing information about a file in a database compared to putting the information in the file header or accompanying file. For example, we have seen many cases in which people fail to keep the two files together; and sometimes they rename one of the files and lose the association between the data and metadata files. Putting the information in the file header avoids these problems but has others. Files are distributed across the disk making searches through file headers very inefficient. Also, files arise from many different sources and it is virtually impossible to guarantee that vendors keep up-to-date with changes. Headers are most useful for a particular type of file, but not for a large system.

Databases solve these problems by having the user interact with files through a unified interface that includes the name of the raw data file on disk as well as the associated metadata. To read the raw data, one consults the database for the location of the file containing the raw data. To read the metadata, one consults only the database. Typically, the database itself is small, and updates to its format or additions to its content are possible. 

Flywheel uses a document database (MongoDB) to manage user interactions with data and metadata. In the Flywheel system, you can read metadata via the web-browser interface. When programming, you can access metadata using the software development kits (SDKs) or REST API. 

Metadata can be attached to any object in the system hierarchy

The Flywheel data are organized in a hierarchy: Group, Project, Subject, Session, Acquisition, Files and Analyses. This hierarchy can incorporate virtually any file type and associated metadata. Most of our customers store files containing medical imaging data in the hierarchy, including MRI, PET, CT, OCT, and pathology images.  But some customers store other types of files, such as computer graphics files that are useful for machine learning. All of the objects, the files and the organizational containers (Project, Subject, Session, Acquisition, Analyses) are described in the database, each with its own metadata. Users can search, annotate and reuse the files and containers from any level in the Flywheel system.

Metadata are flexible

By using a general database, Flywheel can be complete and flexible. For MRI DICOM files, the database includes all of the header information in the file, such as TR, TE, voxel size, and diffusion directions. In addition, the Flywheel database includes fields for users to place searchable notes, say, about the experiment. The database can also include links to additional experimental information about the subject and auxiliary measures (often behavioral data).

The Flywheel database can add fields without needing to rebuild the entire database. For example, as new MRI technologies developed, we were able to add additional fields that describe the new acquisition parameters. Similarly, Flywheel regularly expands to manage new types of data; as we do so, we add new database fields.

Data reuse

Flywheel helps users to reuse data by (a) helping them find data sets and (b) using the search results to create a new project in their database. Adding a database entry eliminates the need for data copying - we simply copy database entries to specify the new project’s sessions, acquisitions, and files.  Flywheel calls such a virtual project a 'Collection'. 

Reproducible science 

Data management and the ability to search across all types of objects enhance the value of the data. Carefully storing and managing metadata supports finding and reusing data, two pillars of FAIR and reproducible research

Contact us here for a demonstration to see how Flywheel’s database and further computing features can be the backbone of your research.