Flywheel leverages Docker heavily for software distribution and algorithm sharing and execution. I had the pleasure of attending Dockercon ‘17, and there were two presentations I’d like to highlight. One is validation that “containers” is an accepted method to achieve shared data processing goals. The other relates to progress on making the Docker image distribution story more consistent in China.
Cool Genes: The Search for a Cure Using Genomics, Big Data and Docker
James Lowey, CIO at Translational Genomics Research Institute (TGEN), presented the system they designed based upon their needs to effectively deal with the genetic data they process to provide more effective treatments for patients
One of James’ starting slides reminds me of one we use. A ceiling-high pyramid of storage media that looks like it is about to topple. Everyone agrees there is value in these troves of unmanaged data. In many cases, the cost of using it is too high due to:
- Low confidence of finding the data of interest, and that the contents match our memory.
- Loss of institutional knowledge of what data is available, or how to access it.
- Effort to retrieve a small bit of data across the whole set for broad analysis
- Changing standards over time for file formats, organization, compression.
Once you have Data Management, you are able to leverage the Docker ecosystem for the benefit of healthcare and research. Specifically, TGEN has developed a number of data processing pipelines, and have constructed a system to execute them
The existing ecosystem of Docker orchestration and cluster solutions and patterns mean TGEN, and other institutions can invest less into software engineering, and more into new ways to analyze the genetic data to improve patient outcomes.
Docker Images provide a platform to ensure execution environments match development/test environments. In the case of TGEN, it is easy to imagine how this creates confidence that the treatment prescribed will not be compromised by such differences. This is one of the core reasons Flywheel has chosen containerization technology from the very start in the pursuit of Reproducible Research.
How do you bootstrap a collaboration network for data scientists to share not just ideas, but data conversion and analysis building blocks? Similar to the automation story, the Docker platform handles many of the packaging/distribution/execution concerns. Now the primary concern becomes establishing a standard way to represent inputs, outputs, execution semantics, and domain-specific variables. Once that is in place, others can contribute new tools that can easily be executed by your data execution engine.
Flywheel has an open specification fitting this mold (Flywheel Gears https://github.com/flywheel-io/gears) and manages the Flywheel Exchange https://github.com/flywheel-io/exchange where contributors can publish their gears for use across Flywheel environments.
Docker in China
Docker Hub is still coming to China! I had been concerned with the silence on this front since the initial partnership with Alibaba Cloud was announced last October. As a stakeholder in Flywheel’s software distribution strategy, I am excited at the prospect of unifying our process to lower complexity and risk to achieve higher customer satisfaction.
The project for offering Dockerhub in China is nearing completion, with expected availability this summer. The free service will be limited to public Docker Hub repositories and replicated from the existing Dockerhub. The details were missing for 1) how separate this China Docker Hub would be, and 2) whether there would be additional hurdles for use by Docker Image authors/publishers, or consumers.
JFrog reps said they would be offering private Docker Registry service within China that will not require a Mainland China business entity. I’m taking that lowered bar to entry with some skepticism. If JFrog can pull that off, and make it easy to use, it will be something I recommend to colleagues.