The Action Collaborative on Neuroscience Data in the Cloud, part of the National Academies of Sciences, Engineering, and Medicine, is working on a document to guide researchers conducting cloud-based research. The “Hitchhiker’s Guide to Using Cloud-Based Resources for Neuroimaging Research” is open for public comment and feedback from now until January 11th, 2021.
Whether you’re a freshly minted Assistant Professor and want to use the latest methods, or you’re a seasoned P.I. with a history of grant awards and publications and want to keep up with and share your best practices for reproducible research, the report is a resource for you. It is developed for investigators and administrators at different levels of experience to help understand, access and successfully use cloud-based tools in neuroscience research.
The guide provides best practices and links to resources for everything you need to consider, including costs, privacy, security, data size/complexity/scope, access to computational resources and expertise, cloud-compliant tools and analysis pipelines, and sharing data. After all, you want your data to be FAIR (Findable, Accessible, Interoperable, and Reusable), right?
Key Takeaways to Get Started
Some of the insights in the report that are particularly valuable include:
- The Hidden Costs of Cloud Computing – Some of the costs associated with cloud computing are unexpected. Consider “hidden” costs such as long-running computational jobs, ingress/egress fees, and inefficient compute management.
- Storing Data So It Can Be Queried – It is important to structure large data and multimodal data so it can be explored. Attaching metadata allows the data to be accessed programmatically and intuitive to someone interacting with the data. Ideally, this metadata should be automatically generated from processing pipelines.
- Start with Getting Data Organized – Data organization should be built into pipelines from the start instead of saved for a later stage.
- Raw data should be distinguished from derived products, saved with read-only permissions, and shouldn’t be duplicated for multiple researchers. A raw data repository can help support access controls.
- Consistent naming conventions like BIDS can also help make projects widely shareable.
- Cutting Down on Data Copies – Being able to explore data without downloading it can reduce replication.
- De-Identification and Privacy – A lot of participant information is captured via DICOM files, including birthdates, embedded text and even facial structure. Look for a way to de-identify DICOM tags and other multimodal data.
- The Hidden Cost of Curation – Curating and organizing your data to comply with IRB requirements and data standards takes time from team members over the months or years of your study.
- Allocating Compute Costs – Making unique labs and teams responsible for their computing costs helps them learn more about cloud computing and make smart decisions about resource consumption.
- Software Pipelines That Scale – Researchers can save time and cost by using existing published pipelines, such as containerized software packages. Docker is a helpful tool for developing containerized analytical programs that can then be scaled in parallel.
If you are curious how Flywheel can save you time with data organization, privacy, security, analysis pipelines and data sharing, reach out to us at email@example.com. Flywheel is cloud-agnostic and supports on-premises computing.
You can find related case studies on the University of California, Irvine’s multi-institutional cloud collaboration and High Performance Computing at the University of Pennsylvania.
Andrew Worth, Ph.D., is a Senior Scientific Solutions Engineer at Flywheel and the Founder and CTO of Neuromorphometrics, which builds a model of the living human brain from MRI scans for “ground truth” comparisons.