前回の続き!
今回は並列化について見ていきます。
前回の例を使って、並列化した場合としてない場合とでどの程度変わるのか実験してみましょう。
※コマンド実行ログが少々長めです。すっ飛ばしてサクッと読んでください。
Snakemakeにおけるthreadsの意味
前回記述したbwa
の部分↓
threads: 8 shell: "bwa mem -t {threads} {input} | samtools view -Sb - > {output}"
ここを見ると、このruleは「8thread使用しますよー」って書いてありますが、 正確には、「最大で8thread使用できますよー」って意味になります。
つまり、コマンドライン引数(--core
)で指定するCPU core数がここで指定されていた値より低い場合は、そのcore数しか使用しません。
論理CPU数が指定した値より低い場合も同様です。
直列で実行
ではまず、完全に直列で実行する場合を試してみましょう。
一度実行済みの場合、すでに結果ファイルが存在するはずなので削除します。
$ rm -rf mapped calls plots
直列で実行したいので、--cores
引数で1を指定します。
また、実行時間及びCPUの使用率を測定したいのでGNU版のtime
コマンドを頭に付けます。
参考: timeコマンドにはシェルの組み込みコマンドとGNU版の2種類ある
$ /usr/bin/time snakemake --use-conda --cores 1 (snakemake) Building DAG of jobs... Using shell: /bin/bash Provided cores: 1 (use --cores to define parallelism) Rules claiming more threads will be scaled down. Job counts: count jobs 1 all 2 bwa 1 call 2 sort 1 stats 7 [Sun Dec 8 19:46:27 2019] rule bwa: input: data/genome.fa, data/samples/B.fastq output: mapped/B.bam jobid: 6 wildcards: sample=B Activating conda environment: /home/kimoton/snakemake-demo/.snakemake/conda/22d67607 [M::bwa_idx_load_from_disk] read 0 ALT contigs [M::process] read 25000 sequences (2525000 bp)... [M::mem_process_seqs] Processed 25000 reads in 1.335 CPU sec, 1.348 real sec [main] Version: 0.7.17-r1188 [main] CMD: bwa mem -t 1 data/genome.fa data/samples/B.fastq [main] Real time: 1.608 sec; CPU: 1.444 sec [Sun Dec 8 19:46:30 2019] Finished job 6. 1 of 7 steps (14%) done [Sun Dec 8 19:46:30 2019] rule sort: input: mapped/B.bam output: mapped/B.sorted.bam jobid: 4 wildcards: sample=B Activating conda environment: /home/kimoton/snakemake-demo/.snakemake/conda/22d67607 Removing temporary output file mapped/B.bam. [Sun Dec 8 19:46:31 2019] Finished job 4. 2 of 7 steps (29%) done [Sun Dec 8 19:46:31 2019] rule bwa: input: data/genome.fa, data/samples/A.fastq output: mapped/A.bam jobid: 5 wildcards: sample=A Activating conda environment: /home/kimoton/snakemake-demo/.snakemake/conda/22d67607 [M::bwa_idx_load_from_disk] read 0 ALT contigs [M::process] read 25000 sequences (2525000 bp)... [M::mem_process_seqs] Processed 25000 reads in 1.246 CPU sec, 1.258 real sec [main] Version: 0.7.17-r1188 [main] CMD: bwa mem -t 1 data/genome.fa data/samples/A.fastq [main] Real time: 1.493 sec; CPU: 1.354 sec [Sun Dec 8 19:46:34 2019] Finished job 5. 3 of 7 steps (43%) done [Sun Dec 8 19:46:34 2019] rule sort: input: mapped/A.bam output: mapped/A.sorted.bam jobid: 3 wildcards: sample=A Activating conda environment: /home/kimoton/snakemake-demo/.snakemake/conda/22d67607 Removing temporary output file mapped/A.bam. [Sun Dec 8 19:46:35 2019] Finished job 3. 4 of 7 steps (57%) done [Sun Dec 8 19:46:35 2019] rule call: input: data/genome.fa, mapped/A.sorted.bam, mapped/B.sorted.bam output: calls/all.vcf jobid: 1 Activating conda environment: /home/kimoton/snakemake-demo/.snakemake/conda/288ca49b [warning] samtools mpileup option `g` is functional, but deprecated. Please switch to using bcftools mpileup in future. [mpileup] 2 samples in 2 input files Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid [Sun Dec 8 19:46:37 2019] Finished job 1. 5 of 7 steps (71%) done [Sun Dec 8 19:46:37 2019] rule stats: input: calls/all.vcf output: plots/quals.svg jobid: 2 Activating conda environment: /home/kimoton/snakemake-demo/.snakemake/conda/e123f192 [Sun Dec 8 19:46:38 2019] Finished job 2. 6 of 7 steps (86%) done [Sun Dec 8 19:46:38 2019] localrule all: input: calls/all.vcf, plots/quals.svg jobid: 0 [Sun Dec 8 19:46:38 2019] Finished job 0. 7 of 7 steps (100%) done Complete log: /home/kimoton/snakemake-demo/.snakemake/log/2019-12-08T194627.149644.snakemake.log 12.27user 3.69system 0:12.59elapsed 126%CPU (0avgtext+0avgdata 78868maxresident)k 0inputs+17928outputs (0major+271997minor)pagefaults 0swaps
実行ログから、
Bのmapping → Bのsort → Aのmapping → Aのsort
→ A, Bをinputとした変異call → plot
という流れ、つまりワークフローが直列で実行されていることがわかります。
time
コマンドの結果から、
実行時間 | 使用CPU |
---|---|
12.59s | 126% |
直列でもすぐ終わる処理なので並列化のありがたみがわからない説が濃厚ですね。
並列で実行(bwa並列化に全力)
まずは手持ちPCの論理CPU数を確認しましょう。
参考: 物理 CPU、CPU コア、および論理 CPU の数を確認する
$ grep processor /proc/cpuinfo | wc -l (snakemake) 12
自分の環境では12 coreでした。
ではこれをフル活用して同じパイプラインを流してみましょう。
先ほど同様に、結果ファイルを削除します。
$ rm -rf mapped calls plots
Snakefile
を編集して、bwa実行に12coreフルで使用するように変更します。
rule bwa: input: "data/genome.fa", "data/samples/{sample}.fastq" output: temp("mapped/{sample}.bam") conda: "envs/mapping.yaml" threads: 12 shell: "bwa mem -t {threads} {input} | samtools view -Sb - > {output}"
--cores 12
を指定していざ実行!
$ /usr/bin/time snakemake --use-conda --cores 12 (snakemake) Building DAG of jobs... Using shell: /bin/bash Provided cores: 12 Rules claiming more threads will be scaled down. Job counts: count jobs 1 all 2 bwa 1 call 2 sort 1 stats 7 [Sun Dec 8 20:05:10 2019] rule bwa: input: data/genome.fa, data/samples/B.fastq output: mapped/B.bam jobid: 6 wildcards: sample=B threads: 12 Activating conda environment: /home/kimoton/snakemake-demo/.snakemake/conda/22d67607 [M::bwa_idx_load_from_disk] read 0 ALT contigs [M::process] read 25000 sequences (2525000 bp)... [M::mem_process_seqs] Processed 25000 reads in 4.439 CPU sec, 0.453 real sec [main] Version: 0.7.17-r1188 [main] CMD: bwa mem -t 12 data/genome.fa data/samples/B.fastq [main] Real time: 0.641 sec; CPU: 4.519 sec [Sun Dec 8 20:05:12 2019] Finished job 6. 1 of 7 steps (14%) done [Sun Dec 8 20:05:12 2019] rule sort: input: mapped/B.bam output: mapped/B.sorted.bam jobid: 4 wildcards: sample=B Activating conda environment: /home/kimoton/snakemake-demo/.snakemake/conda/22d67607 Removing temporary output file mapped/B.bam. [Sun Dec 8 20:05:14 2019] Finished job 4. 2 of 7 steps (29%) done [Sun Dec 8 20:05:14 2019] rule bwa: input: data/genome.fa, data/samples/A.fastq output: mapped/A.bam jobid: 5 wildcards: sample=A threads: 12 Activating conda environment: /home/kimoton/snakemake-demo/.snakemake/conda/22d67607 [M::bwa_idx_load_from_disk] read 0 ALT contigs [M::process] read 25000 sequences (2525000 bp)... [M::mem_process_seqs] Processed 25000 reads in 4.523 CPU sec, 0.478 real sec [main] Version: 0.7.17-r1188 [main] CMD: bwa mem -t 12 data/genome.fa data/samples/A.fastq [main] Real time: 0.732 sec; CPU: 4.626 sec [Sun Dec 8 20:05:15 2019] Finished job 5. 3 of 7 steps (43%) done [Sun Dec 8 20:05:15 2019] rule sort: input: mapped/A.bam output: mapped/A.sorted.bam jobid: 3 wildcards: sample=A Activating conda environment: /home/kimoton/snakemake-demo/.snakemake/conda/22d67607 Removing temporary output file mapped/A.bam. [Sun Dec 8 20:05:16 2019] Finished job 3. 4 of 7 steps (57%) done [Sun Dec 8 20:05:16 2019] rule call: input: data/genome.fa, mapped/A.sorted.bam, mapped/B.sorted.bam output: calls/all.vcf jobid: 1 Activating conda environment: /home/kimoton/snakemake-demo/.snakemake/conda/288ca49b [warning] samtools mpileup option `g` is functional, but deprecated. Please switch to using bcftools mpileup in future. [mpileup] 2 samples in 2 input files Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid [Sun Dec 8 20:05:18 2019] Finished job 1. 5 of 7 steps (71%) done [Sun Dec 8 20:05:18 2019] rule stats: input: calls/all.vcf output: plots/quals.svg jobid: 2 Activating conda environment: /home/kimoton/snakemake-demo/.snakemake/conda/e123f192 [Sun Dec 8 20:05:20 2019] Finished job 2. 6 of 7 steps (86%) done [Sun Dec 8 20:05:20 2019] localrule all: input: calls/all.vcf, plots/quals.svg jobid: 0 [Sun Dec 8 20:05:20 2019] Finished job 0. 7 of 7 steps (100%) done Complete log: /home/kimoton/snakemake-demo/.snakemake/log/2019-12-08T200510.793223.snakemake.log 17.77user 3.82system 0:10.25elapsed 210%CPU (0avgtext+0avgdata 76884maxresident)k 0inputs+17928outputs (0major+273156minor)pagefaults 0swaps
先ほど同様、ワークフロー自体は直列に実行されているため、
Bのmapping → Bのsort → Aのmapping → Aのsort
→ A, Bをinputとした変異call → plot
となっていることが実行ログで分かります。
一方でbwaの処理が最大限並列化されているため先ほどよりは早くなるはず。
結果は!
実行時間 | 使用CPU |
---|---|
10.25s | 210% |
あれ、そんな高速化してない、、
CPUも全然使われてない、、
並列で実行(A, Bを並列処理)
A, Bを最大限並列処理するには、6coreずつ割り当てればよいことになります。
bwa
ruleで使用するcore数を6に変更します。
rule bwa: input: "data/genome.fa", "data/samples/{sample}.fastq" output: temp("mapped/{sample}.bam") conda: "envs/mapping.yaml" threads: 6 shell: "bwa mem -t {threads} {input} | samtools view -Sb - > {output}"
--cores 12
を指定していざ実行!
$ /usr/bin/time snakemake --use-conda --cores 12 (snakemake) Building DAG of jobs... Nothing to be done. Complete log: /home/kimoton/snakemake-demo/.snakemake/log/2019-12-08T202007.448808.snakemake.log 0.48user 0.05system 0:00.97elapsed 56%CPU (0avgtext+0avgdata 25240maxresident)k 0inputs+24outputs (0major+8001minor)pagefaults 0swaps ~/snakemake-demo rm -rf mapped calls plots (snakemake) ~/snakemake-demo /usr/bin/time snakemake --use-conda --cores 12 (snakemake) Building DAG of jobs... Using shell: /bin/bash Provided cores: 12 Rules claiming more threads will be scaled down. Job counts: count jobs 1 all 2 bwa 1 call 2 sort 1 stats 7 [Sun Dec 8 20:20:12 2019] rule bwa: input: data/genome.fa, data/samples/B.fastq output: mapped/B.bam jobid: 6 wildcards: sample=B threads: 6 [Sun Dec 8 20:20:12 2019] rule bwa: input: data/genome.fa, data/samples/A.fastq output: mapped/A.bam jobid: 5 wildcards: sample=A threads: 6 Activating conda environment: /home/kimoton/snakemake-demo/.snakemake/conda/22d67607 Activating conda environment: /home/kimoton/snakemake-demo/.snakemake/conda/22d67607 [M::bwa_idx_load_from_disk] read 0 ALT contigs [M::bwa_idx_load_from_disk] read 0 ALT contigs [M::process] read 25000 sequences (2525000 bp)... [M::process] read 25000 sequences (2525000 bp)... [M::mem_process_seqs] Processed 25000 reads in 4.491 CPU sec, 0.836 real sec [M::mem_process_seqs] Processed 25000 reads in 4.429 CPU sec, 0.828 real sec [main] Version: 0.7.17-r1188 [main] CMD: bwa mem -t 6 data/genome.fa data/samples/B.fastq [main] Real time: 1.122 sec; CPU: 4.608 sec [Sun Dec 8 20:20:15 2019] Finished job 6. 1 of 7 steps (14%) done [Sun Dec 8 20:20:15 2019] rule sort: input: mapped/B.bam output: mapped/B.sorted.bam jobid: 4 wildcards: sample=B [main] Version: 0.7.17-r1188 [main] CMD: bwa mem -t 6 data/genome.fa data/samples/A.fastq [main] Real time: 1.124 sec; CPU: 4.551 sec [Sun Dec 8 20:20:15 2019] Finished job 5. 2 of 7 steps (29%) done [Sun Dec 8 20:20:15 2019] rule sort: input: mapped/A.bam output: mapped/A.sorted.bam jobid: 3 wildcards: sample=A Activating conda environment: /home/kimoton/snakemake-demo/.snakemake/conda/22d67607 Activating conda environment: /home/kimoton/snakemake-demo/.snakemake/conda/22d67607 Removing temporary output file mapped/A.bam. [Sun Dec 8 20:20:16 2019] Finished job 3. 3 of 7 steps (43%) done Removing temporary output file mapped/B.bam. [Sun Dec 8 20:20:16 2019] Finished job 4. 4 of 7 steps (57%) done [Sun Dec 8 20:20:16 2019] rule call: input: data/genome.fa, mapped/A.sorted.bam, mapped/B.sorted.bam output: calls/all.vcf jobid: 1 Activating conda environment: /home/kimoton/snakemake-demo/.snakemake/conda/288ca49b [warning] samtools mpileup option `g` is functional, but deprecated. Please switch to using bcftools mpileup in future. [mpileup] 2 samples in 2 input files Note: none of --samples-file, --ploidy or --ploidy-file given, assuming all sites are diploid [Sun Dec 8 20:20:18 2019] Finished job 1. 5 of 7 steps (71%) done [Sun Dec 8 20:20:18 2019] rule stats: input: calls/all.vcf output: plots/quals.svg jobid: 2 Activating conda environment: /home/kimoton/snakemake-demo/.snakemake/conda/e123f192 [Sun Dec 8 20:20:19 2019] Finished job 2. 6 of 7 steps (86%) done [Sun Dec 8 20:20:19 2019] localrule all: input: calls/all.vcf, plots/quals.svg jobid: 0 [Sun Dec 8 20:20:19 2019] Finished job 0. 7 of 7 steps (100%) done Complete log: /home/kimoton/snakemake-demo/.snakemake/log/2019-12-08T202012.421181.snakemake.log 18.99user 3.87system 0:08.17elapsed 279%CPU (0avgtext+0avgdata 78760maxresident)k 0inputs+17952outputs (0major+303308minor)pagefaults 0swaps
今度はワークフロー自体が並列化されており、
Bのmapping / Aのmapping → Bのsort / Aのsort
→ A, Bをinputとした変異call → plot
という美しい流れでsort部分も含め並列に処理できていることがわかります。
結果は!
実行時間 | 使用CPU |
---|---|
8.17s | 279% |
おー早くなった。
まとめ
- SnakemakeのThreadsディレクティブは最大何thread使うかを意味している。
- 並列化はサンプル間の処理を並列化すべし(例外あり)。
- 例が悪い。もっと時間のかかる処理で試すべきだった。