Snakemakeを使って再現性のある!スケーラブルな!可搬性のある解析! - ③

の続き!

今回は並列化について見ていきます。
前回の例を使って、並列化した場合としてない場合とでどの程度変わるのか実験してみましょう。

※コマンド実行ログが少々長めです。すっ飛ばしてサクッと読んでください。

Snakemakeにおけるthreadsの意味

前回記述したbwaの部分↓

    threads: 8
    shell:
        "bwa mem -t {threads} {input} | samtools view -Sb - > {output}"

ここを見ると、このruleは「8thread使用しますよー」って書いてありますが、 正確には、「最大で8thread使用できますよー」って意味になります。

つまり、コマンドライン引数(--core)で指定するCPU core数がここで指定されていた値より低い場合は、そのcore数しか使用しません。
論理CPU数が指定した値より低い場合も同様です。

直列で実行

ではまず、完全に直列で実行する場合を試してみましょう。

一度実行済みの場合、すでに結果ファイルが存在するはずなので削除します。

$ rm -rf mapped calls plots 

直列で実行したいので、--cores引数で1を指定します。
また、実行時間及びCPUの使用率を測定したいのでGNU版のtimeコマンドを頭に付けます。

参考: timeコマンドにはシェルの組み込みコマンドとGNU版の2種類ある

$ /usr/bin/time snakemake --use-conda --cores 1                                                                                         (snakemake) 
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       all
        2       bwa
        1       call
        2       sort
        1       stats
        7

[Sun Dec  8 19:46:27 2019]
rule bwa:
    input: data/genome.fa, data/samples/B.fastq
    output: mapped/B.bam
    jobid: 6
    wildcards: sample=B

Activating conda environment: /home/kimoton/snakemake-demo/.snakemake/conda/22d67607
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 25000 sequences (2525000 bp)...
[M::mem_process_seqs] Processed 25000 reads in 1.335 CPU sec, 1.348 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem -t 1 data/genome.fa data/samples/B.fastq
[main] Real time: 1.608 sec; CPU: 1.444 sec
[Sun Dec  8 19:46:30 2019]
Finished job 6.
1 of 7 steps (14%) done

[Sun Dec  8 19:46:30 2019]
rule sort:
    input: mapped/B.bam
    output: mapped/B.sorted.bam
    jobid: 4
    wildcards: sample=B

Activating conda environment: /home/kimoton/snakemake-demo/.snakemake/conda/22d67607
Removing temporary output file mapped/B.bam.
[Sun Dec  8 19:46:31 2019]
Finished job 4.
2 of 7 steps (29%) done

[Sun Dec  8 19:46:31 2019]
rule bwa:
    input: data/genome.fa, data/samples/A.fastq
    output: mapped/A.bam
    jobid: 5
    wildcards: sample=A

Activating conda environment: /home/kimoton/snakemake-demo/.snakemake/conda/22d67607
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 25000 sequences (2525000 bp)...
[M::mem_process_seqs] Processed 25000 reads in 1.246 CPU sec, 1.258 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem -t 1 data/genome.fa data/samples/A.fastq
[main] Real time: 1.493 sec; CPU: 1.354 sec
[Sun Dec  8 19:46:34 2019]
Finished job 5.
3 of 7 steps (43%) done

[Sun Dec  8 19:46:34 2019]
rule sort:
    input: mapped/A.bam
    output: mapped/A.sorted.bam
    jobid: 3
    wildcards: sample=A

Activating conda environment: /home/kimoton/snakemake-demo/.snakemake/conda/22d67607
Removing temporary output file mapped/A.bam.
[Sun Dec  8 19:46:35 2019]
Finished job 3.
4 of 7 steps (57%) done

[Sun Dec  8 19:46:35 2019]
rule call:
    input: data/genome.fa, mapped/A.sorted.bam, mapped/B.sorted.bam
    output: calls/all.vcf
    jobid: 1

Activating conda environment: /home/kimoton/snakemake-demo/.snakemake/conda/288ca49b
[warning] samtools mpileup option `g` is functional, but deprecated. Please switch to using bcftools mpileup in future.
[mpileup] 2 samples in 2 input files
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[Sun Dec  8 19:46:37 2019]
Finished job 1.
5 of 7 steps (71%) done

[Sun Dec  8 19:46:37 2019]
rule stats:
    input: calls/all.vcf
    output: plots/quals.svg
    jobid: 2

Activating conda environment: /home/kimoton/snakemake-demo/.snakemake/conda/e123f192
[Sun Dec  8 19:46:38 2019]
Finished job 2.
6 of 7 steps (86%) done

[Sun Dec  8 19:46:38 2019]
localrule all:
    input: calls/all.vcf, plots/quals.svg
    jobid: 0

[Sun Dec  8 19:46:38 2019]
Finished job 0.
7 of 7 steps (100%) done
Complete log: /home/kimoton/snakemake-demo/.snakemake/log/2019-12-08T194627.149644.snakemake.log
12.27user 3.69system 0:12.59elapsed 126%CPU (0avgtext+0avgdata 78868maxresident)k
0inputs+17928outputs (0major+271997minor)pagefaults 0swaps

実行ログから、

Bのmapping → Bのsort → Aのmapping → Aのsort
→ A, Bをinputとした変異call → plot

という流れ、つまりワークフローが直列で実行されていることがわかります。 timeコマンドの結果から、

実行時間 使用CPU
12.59s 126%

直列でもすぐ終わる処理なので並列化のありがたみがわからない説が濃厚ですね。

並列で実行(bwa並列化に全力)

まずは手持ちPCの論理CPU数を確認しましょう。

参考: 物理 CPU、CPU コア、および論理 CPU の数を確認する

$ grep processor /proc/cpuinfo | wc -l                                                                                              (snakemake) 
12

自分の環境では12 coreでした。
ではこれをフル活用して同じパイプラインを流してみましょう。

先ほど同様に、結果ファイルを削除します。

$ rm -rf mapped calls plots 

Snakefileを編集して、bwa実行に12coreフルで使用するように変更します。

rule bwa:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        temp("mapped/{sample}.bam")
    conda:
        "envs/mapping.yaml"
    threads: 12
    shell:
        "bwa mem -t {threads} {input} | samtools view -Sb - > {output}"

--cores 12を指定していざ実行!

$ /usr/bin/time snakemake --use-conda --cores 12                                                                                        (snakemake) 
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 12
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       all
        2       bwa
        1       call
        2       sort
        1       stats
        7

[Sun Dec  8 20:05:10 2019]
rule bwa:
    input: data/genome.fa, data/samples/B.fastq
    output: mapped/B.bam
    jobid: 6
    wildcards: sample=B
    threads: 12

Activating conda environment: /home/kimoton/snakemake-demo/.snakemake/conda/22d67607
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 25000 sequences (2525000 bp)...
[M::mem_process_seqs] Processed 25000 reads in 4.439 CPU sec, 0.453 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem -t 12 data/genome.fa data/samples/B.fastq
[main] Real time: 0.641 sec; CPU: 4.519 sec
[Sun Dec  8 20:05:12 2019]
Finished job 6.
1 of 7 steps (14%) done

[Sun Dec  8 20:05:12 2019]
rule sort:
    input: mapped/B.bam
    output: mapped/B.sorted.bam
    jobid: 4
    wildcards: sample=B

Activating conda environment: /home/kimoton/snakemake-demo/.snakemake/conda/22d67607
Removing temporary output file mapped/B.bam.
[Sun Dec  8 20:05:14 2019]
Finished job 4.
2 of 7 steps (29%) done

[Sun Dec  8 20:05:14 2019]
rule bwa:
    input: data/genome.fa, data/samples/A.fastq
    output: mapped/A.bam
    jobid: 5
    wildcards: sample=A
    threads: 12

Activating conda environment: /home/kimoton/snakemake-demo/.snakemake/conda/22d67607
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 25000 sequences (2525000 bp)...
[M::mem_process_seqs] Processed 25000 reads in 4.523 CPU sec, 0.478 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem -t 12 data/genome.fa data/samples/A.fastq
[main] Real time: 0.732 sec; CPU: 4.626 sec
[Sun Dec  8 20:05:15 2019]
Finished job 5.
3 of 7 steps (43%) done

[Sun Dec  8 20:05:15 2019]
rule sort:
    input: mapped/A.bam
    output: mapped/A.sorted.bam
    jobid: 3
    wildcards: sample=A

Activating conda environment: /home/kimoton/snakemake-demo/.snakemake/conda/22d67607
Removing temporary output file mapped/A.bam.
[Sun Dec  8 20:05:16 2019]
Finished job 3.
4 of 7 steps (57%) done

[Sun Dec  8 20:05:16 2019]
rule call:
    input: data/genome.fa, mapped/A.sorted.bam, mapped/B.sorted.bam
    output: calls/all.vcf
    jobid: 1

Activating conda environment: /home/kimoton/snakemake-demo/.snakemake/conda/288ca49b
[warning] samtools mpileup option `g` is functional, but deprecated. Please switch to using bcftools mpileup in future.
[mpileup] 2 samples in 2 input files
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[Sun Dec  8 20:05:18 2019]
Finished job 1.
5 of 7 steps (71%) done

[Sun Dec  8 20:05:18 2019]
rule stats:
    input: calls/all.vcf
    output: plots/quals.svg
    jobid: 2

Activating conda environment: /home/kimoton/snakemake-demo/.snakemake/conda/e123f192
[Sun Dec  8 20:05:20 2019]
Finished job 2.
6 of 7 steps (86%) done

[Sun Dec  8 20:05:20 2019]
localrule all:
    input: calls/all.vcf, plots/quals.svg
    jobid: 0

[Sun Dec  8 20:05:20 2019]
Finished job 0.
7 of 7 steps (100%) done
Complete log: /home/kimoton/snakemake-demo/.snakemake/log/2019-12-08T200510.793223.snakemake.log
17.77user 3.82system 0:10.25elapsed 210%CPU (0avgtext+0avgdata 76884maxresident)k
0inputs+17928outputs (0major+273156minor)pagefaults 0swaps

先ほど同様、ワークフロー自体は直列に実行されているため、

Bのmapping → Bのsort → Aのmapping → Aのsort
→ A, Bをinputとした変異call → plot

となっていることが実行ログで分かります。
一方でbwaの処理が最大限並列化されているため先ほどよりは早くなるはず。

結果は!

実行時間 使用CPU
10.25s 210%

あれ、そんな高速化してない、、
CPUも全然使われてない、、

並列で実行(A, Bを並列処理)

A, Bを最大限並列処理するには、6coreずつ割り当てればよいことになります。

bwaruleで使用するcore数を6に変更します。

rule bwa:
    input:
        "data/genome.fa",
        "data/samples/{sample}.fastq"
    output:
        temp("mapped/{sample}.bam")
    conda:
        "envs/mapping.yaml"
    threads: 6
    shell:
        "bwa mem -t {threads} {input} | samtools view -Sb - > {output}"

--cores 12を指定していざ実行!

$ /usr/bin/time snakemake --use-conda --cores 12                                                                                        (snakemake) 
Building DAG of jobs...
Nothing to be done.
Complete log: /home/kimoton/snakemake-demo/.snakemake/log/2019-12-08T202007.448808.snakemake.log
0.48user 0.05system 0:00.97elapsed 56%CPU (0avgtext+0avgdata 25240maxresident)k
0inputs+24outputs (0major+8001minor)pagefaults 0swaps
 ~/snakemake-demo  rm -rf mapped calls plots                                                                                                             (snakemake) 
 ~/snakemake-demo  /usr/bin/time snakemake --use-conda --cores 12                                                                                        (snakemake) 
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 12
Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       all
        2       bwa
        1       call
        2       sort
        1       stats
        7

[Sun Dec  8 20:20:12 2019]
rule bwa:
    input: data/genome.fa, data/samples/B.fastq
    output: mapped/B.bam
    jobid: 6
    wildcards: sample=B
    threads: 6


[Sun Dec  8 20:20:12 2019]
rule bwa:
    input: data/genome.fa, data/samples/A.fastq
    output: mapped/A.bam
    jobid: 5
    wildcards: sample=A
    threads: 6

Activating conda environment: /home/kimoton/snakemake-demo/.snakemake/conda/22d67607
Activating conda environment: /home/kimoton/snakemake-demo/.snakemake/conda/22d67607
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::process] read 25000 sequences (2525000 bp)...
[M::process] read 25000 sequences (2525000 bp)...
[M::mem_process_seqs] Processed 25000 reads in 4.491 CPU sec, 0.836 real sec
[M::mem_process_seqs] Processed 25000 reads in 4.429 CPU sec, 0.828 real sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa mem -t 6 data/genome.fa data/samples/B.fastq
[main] Real time: 1.122 sec; CPU: 4.608 sec
[Sun Dec  8 20:20:15 2019]
Finished job 6.
1 of 7 steps (14%) done

[Sun Dec  8 20:20:15 2019]
rule sort:
    input: mapped/B.bam
    output: mapped/B.sorted.bam
    jobid: 4
    wildcards: sample=B

[main] Version: 0.7.17-r1188
[main] CMD: bwa mem -t 6 data/genome.fa data/samples/A.fastq
[main] Real time: 1.124 sec; CPU: 4.551 sec
[Sun Dec  8 20:20:15 2019]
Finished job 5.
2 of 7 steps (29%) done

[Sun Dec  8 20:20:15 2019]
rule sort:
    input: mapped/A.bam
    output: mapped/A.sorted.bam
    jobid: 3
    wildcards: sample=A

Activating conda environment: /home/kimoton/snakemake-demo/.snakemake/conda/22d67607
Activating conda environment: /home/kimoton/snakemake-demo/.snakemake/conda/22d67607
Removing temporary output file mapped/A.bam.
[Sun Dec  8 20:20:16 2019]
Finished job 3.
3 of 7 steps (43%) done
Removing temporary output file mapped/B.bam.
[Sun Dec  8 20:20:16 2019]
Finished job 4.
4 of 7 steps (57%) done

[Sun Dec  8 20:20:16 2019]
rule call:
    input: data/genome.fa, mapped/A.sorted.bam, mapped/B.sorted.bam
    output: calls/all.vcf
    jobid: 1

Activating conda environment: /home/kimoton/snakemake-demo/.snakemake/conda/288ca49b
[warning] samtools mpileup option `g` is functional, but deprecated. Please switch to using bcftools mpileup in future.
[mpileup] 2 samples in 2 input files
Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid
[Sun Dec  8 20:20:18 2019]
Finished job 1.
5 of 7 steps (71%) done

[Sun Dec  8 20:20:18 2019]
rule stats:
    input: calls/all.vcf
    output: plots/quals.svg
    jobid: 2

Activating conda environment: /home/kimoton/snakemake-demo/.snakemake/conda/e123f192
[Sun Dec  8 20:20:19 2019]
Finished job 2.
6 of 7 steps (86%) done

[Sun Dec  8 20:20:19 2019]
localrule all:
    input: calls/all.vcf, plots/quals.svg
    jobid: 0

[Sun Dec  8 20:20:19 2019]
Finished job 0.
7 of 7 steps (100%) done
Complete log: /home/kimoton/snakemake-demo/.snakemake/log/2019-12-08T202012.421181.snakemake.log
18.99user 3.87system 0:08.17elapsed 279%CPU (0avgtext+0avgdata 78760maxresident)k
0inputs+17952outputs (0major+303308minor)pagefaults 0swaps

今度はワークフロー自体が並列化されており、

Bのmapping / Aのmapping → Bのsort / Aのsort
→ A, Bをinputとした変異call → plot

という美しい流れでsort部分も含め並列に処理できていることがわかります。

結果は!

実行時間 使用CPU
8.17s 279%

おー早くなった。

まとめ

  • SnakemakeのThreadsディレクティブは最大何thread使うかを意味している。
  • 並列化はサンプル間の処理を並列化すべし(例外あり)。
  • 例が悪い。もっと時間のかかる処理で試すべきだった。