With great data come great pipelines:

class: center, middle, inverse, title-slide

# With great data come great pipelines:
## Creating flexible standardized pipelines for common<br/>biomedical analysis tasks using Snakemake
### Michael Geuenich
### Lunenfeld Tanenbaum Research Institute
### 2022/02/24

---

class: black-out

# Biomedical research questions

My work is focussed on cancer-immune relationship
--
.pull-left[
<img src="figures/cancer-t-cells.png" width="500" height="330" />
]
--
.pull-right[
## Specific biomedical questions of interest
   - How many immune cells are present in a cancer biopsy?
   - What states are the cancer and immune cells in?
   - What type of variability exists in cancer cells upon perturbation (e.g. chemotherapy treatment)?
]

---

# Exponential increase in sequencing available

.pull-left[
<img src="figures/falling-fast-nature.png" width="500" />
```
Source: Nature.com
```
]

???
Moores law = # of transistors on an integrated circuit doubled every other year
--
.pull-right[
### Publicly deposited sequence data (TB)
<img src="figures/Sra_growth.png" width="700" />
```
Source: ncbiinsights
```
]

### Datasets of this magnitude require robust processing and analysis pipelines

---

# How not to do reproducible research
.oneThirdsPullLeft[
<img src="figures/CR-robot.png" width="300" />
```
Handling robot. Source: protocols.io
```
]

.twoThirdsPullRight[
Obtained ~1,000 sequencing samples from diverse cancer models using this robot.
{{content}}
]
--

- Submitting data cleaning jobs manually.

--
```
    for f in data/exp1/*; do sbatch --wrap="command param1 ..."; done
```
  
{{content}}
???
  - Five months later you will ask "What parameters did I use again?"
--

-  Re-write code and commands for each new experiment

- Rscripts/functions that accomplished individual tasks, but were not linked to the project

### Other issues with this approach

- Changes in upstream steps (e.g. data processing) do not trigger a re-run of subsequent steps
- Code reusability across experiments was poor
{{content}}

---

# General desirable characteristics of a pipeline

- Reproducible

- Setting a seed
  - Relative paths

--
- Only write your code once

- E.g. re-use the same clustering script changing just the clustering parameters

--
- Tie scripts into a single executable workflow

- Create a results report at the end
--

- Automatically re-run workflow sections if there are upstream changes
```
      Download data -> cluster data -> create figure
```

- Re-use the same pipeline on different datasets

- Could be a complex data processing pipeline
--
    - Challenge: 99% of the code remains the same, only parameters change

???
I imagine this list holds quite well outside the biomedical domain too

---

# Concrete biomedical example in single-cell RNA-Sequencing

Single-cell RNA-Sequencing: Sequence individual cells and understand their heterogeneity

--
   - How many immune cells are present in a cancer biopsy?
   - What states are the cancer and immune cells in?
   - What type of variability exists in cancer cells upon perturbation (e.g. chemotherapy treatment)?

.pull-left[
### Concrete biomedical example in single-cell RNA-Sequencing

1. Filter out dead and damaged cells

- Based on the number of overall reads and percentage of reads from mitochondrial genes
  - Threshods are very sample dependent: tumour samples have higher % of mitochondrial reads
  
]

???

Dead and damaged cells depend on sample preparation.

Because of technical reasons, dead/damaged cells have a higher proportion of dead cells

--
.pull-right[
<img src="figures/mito-detected.png" width="357" height="350" />
]

---

# General analysis pipeline steps

.pull-left[
1. Filter out dead/damaged cells
2. Normalize data
{{content}}
]
--
3. Cluster cells
{{content}}
--
4. Assign each cluster to cell types based on the expression of certain markers
{{content}}
--
  - Some clusters are impure and contain multiple cell types
{{content}}
--
  - Analysis can be aided by creating scatterplots of specific markers
{{content}}
--

## Data dependent parameters
1. number of detected reads & % mitochondrial genes
2. clusters that need re-clustering
3. markers to correlate
{{content}}

---

# General analysis pipeline steps

.pull-left[
1. Filter out dead/damaged cells
2. Normalize data
3. Cluster cells
4. Assign each cluster to cell types based on the expression of certain markers
  - Some clusters are impure and contain multiple cell types
  - Analysis can be aided by creating scatterplots of specific markers

## Data dependent parameters
1. number of detected reads & % mitochondrial genes
2. clusters that need re-clustering
3. markers to correlate
]

.pipelineRequirementsBox[
### Summary of requirements
- Reproducible

- Create a single executable workflow

- Re-run the same script over different samples/parameters

- Automatically re-run if there are upstream changes

- Reuse the same pipeline despite data-dependent parameter diffences between datasets
]

---

# Snakemake introduction

**Snakemake is a Python based workflow manager to create reproducible and scalable analyses**

Snakemake is based on rules: 
- Each analysis step can be implemented as a human readable snippet
--

- Workflows can run on local, cluster or cloud environments without needing to modify the workflow

```
rule plot_amazing_figure:
  input:
    data = 'output/data/my-processed-data.csv'
  output:
    figure = 'figures/my-amazing-figure.pdf'
  script:
    'scripts/make-my-figure.R'
```

- Snippets can be tied together to create a complete workflow

???
- Input
- Output
- What generates it

Can list as many input/output files as I like

---
# Snakemake introduction
.pull-left[
```
rule all:
  input:
    'output/data/my-processed-data.csv',
    'figures/my-amazing-figure.pdf'

rule plot_amazing_figure:
  input:
    data = 'output/data/my-processed-data.csv'
  output:
    figure = 'figures/my-amazing-figure.pdf'
  script:
    'scripts/make-my-figure.R'

rule download_my_data:
  params:
    website: 'www.mywebsite.com/my_csvfile.csv'
  output:
    csv = 'output/data/my-processed-data.csv'
  shell:
    'curl {params.website} -o {output.csv}'
```
]

--
.pull-right[
<img src="figures/rules.png" width="316" height="400" />
]

---

# Executing the same script over multiple samples/parameters

```
rule all:
  input:
    expand('figures/my-amazing-figure-{samples_list}.pdf', samples_list = ['s1', 's2', 's3'])
```

--
```
rule plot_amazing_figure:
  input:
    data = 'output/data/my-processed-data-{samples_list}.csv'
  output:
    figure = 'figures/my-amazing-figure-{samples_list}.pdf'
  script:
    'scripts/make-my-figure.R'

rule download_my_data:
  params:
    website: 'www.mywebsite.com/my_csvfile-{samples_list}.csv'
  output:
    csv = 'output/data/my-processed-data-{samples_list}.csv'
  shell:
    'curl {params.website} -o {output.csv}'
```

???

Snakemake will also re-run the sample if there have been changes upstream.

Example, say I download my data again, then plotting would run again too

If the output of a rule is not specified in rule all, the rule itself will not cause any errors but will also not run

---

# Configuration & conditional rule execution

- A configuration file can easily be added to each Snakefile

```
    configfile: 'config.yaml'
```

Good place to keep:
- File paths to data/software binaries
- Data filtering parameters

## Conditional rule execution

.pull-left[
```
checkpoint gene_expression_in_clusters:
    input:
        "output/{cluster}.txt"
    params:
        config['cluster_parameter']
    output:
        "figures/gene1_expression_in_{cluster}.png"
    script:
        "plotGeneExpression.R"
```
]
--

.pull-right[
- Useful when the output is data-dependent and can only be determined after other parts of the workflow have completed

- E.g. Plotting the expression of a gene in all clusters
{{content}}
]
--
- **Does not work when subsequent results depend on _manual_ inspection of plots**

---
class: conditionalRule-code
# Conditional rule execution
How can we build pipelines that can be reused (e.g. to process a dataset) when steps are 
1. data dependent and 
2. require manual inspection of results to determine next steps

.oneThirdsPullLeft[
<img src="figures/mito-detected.png" width="357" height="350" />
]

```
configfile: 'my_config.yaml'

my_results_dict = {
  'downloaded_data': 'output/data/downloaded-data.csv'
}
```
--
```

if config['dead_cells'] is not None:
  my_results_dict['my_figure'] = 'processed-data.csv'
```
--
```

rule all:
  input:
    my_results_dict.values()
```
--
```

rule process_data:
  input: 
...
```

---

# Create an HTML report

```
snakemake --report my-amazing-report.html
```
--

---

# Snakemake run statistics

---

# Snakemake run config
.center[
<img src="figures/Snakemake-report-config.png" width="800" />
]

---

# Rule outputs

.center[
<img src="figures/Snakemake-report-rules.png" width="1533" />
]

---
# Summary

- Reproducible

- Snakemake pipelines can be shared via github

- Create a single executable workflow

- Snakemake ties together individual rules

- Re-run the same script over different samples/parameters

- Can be accomplished with the `expand()` function

- Automatically re-run if there are upstream changes

- Automatically done by Snakemake if files are changed

- Reuse the same pipeline despite data-dependent parameter differences between datasets

- Checkpoints can be used if no human intervention is required
  - If/else statements can be used to add outputs to `rule all` when parameters must be determined manually

---
# Resources

Snakemake has support for:
- Docker containers
- Conda environments
- Wrappers for common tools
- Log outputs
- HPC specifications (e.g. # of threads or memory)

Snakemake tutorial: https://snakemake.readthedocs.io/en/stable/tutorial/tutorial.html

Snakemake documentation: https://snakemake.readthedocs.io/en/stable/index.html

Slides: https://michael-geuenich.github.io/2022_Toronto_workshop_reproducibility/

Contact: mgeuenich@lunenfeld.ca