CAMP: A modular metagenomics analysis system for integrated multi-step data exploration.
Article
Overview
abstract
MOTIVATION: Computational analysis of large-scale metagenomics sequencing datasets have proven to be both incredibly valuable for extracting isolate-level taxonomic, and functional insights from complex microbial communities. However, due to an ever-expanding ecosystem of metagenomics-specific methods and file-formats, designing seamless and scalable end-to-end workflows, and exploring the massive amounts of output data have become studies unto themselves. One-click bioinformatics pipelines have helped to organize these tools into targeted workflows, but they suffer from general compatibility and maintainability issues, and preclude replication. METHODS: To address the gap in easily extensible yet robustly distributable metagenomics workflows, we have developed a module-based metagenomics analysis system "Core Analysis Modular Pipeline" (CAMP), written in Snakemake, a popular workflow management system, along with a standardized module and working directory architecture. Each module can be run independently or conjointly with a series of others to produce the target data format (e.g. short-read preprocessing alone, or short-read preprocessing followed by de novo assembly), and outputs aggregated summary statistics reports and semi-guided Jupyter notebook-based visualizations. RESULTS: We have applied CAMP to a set of ten metagenomics samples to demonstrate how a modular analysis system with built-in data visualization at intermediate steps facilitates rich and seamless inter-communication between output data from different analytic purposes. AVAILABILITY: The CAMP ecosystem (module template and analysis modules) can be found https://github.com/Meta-CAMP.