代码地址
https://jihulab.com/BioQuest/smkhss
https://github.com/BioQuestX/smkhss
Mutect
与HaplotypeCaller类似,Mutect2通过active region单倍型的局部从头组装,同时Call SNV和INDEL变异。
Mutect有三种模式,这里使用肿瘤与正常配对模式;
Mutect为了节省计算不再考虑配对样本中的 germline 变异;
panel-of-normals
:可选输入,作为Mutect Call 变异时的噪音参考;
germline-resource
:可选输入,包含等位基因部分的种系测序群体vcf;
f1r2
:F1R2 counts 可以提供给LearnReadOrientationModel作为输入
rule Mutect:
input:
ref="path/to/Homo_sapiens_assembly38.fasta",
sam=["results/prepared/{s}.T.cram","results/prepared/{s}.N.cram"],
sam_idx=["results/prepared/{s}.T.cram.crai","results/prepared/{s}.N.cram.crai"],
intervals="path/to/captured_regions.bed",
germline="path/to/af-only-gnomad.hg38.vcf.gz",
germline_idx="path/to/af-only-gnomad.hg38.vcf.gz.tbi",
pon=gatk_dict"path/to/1000g_pon.hg38.vcf.gz",
pon_idx="path/to/1000g_pon.hg38.vcf.gz.tbi",
output:
vcf="results/called/{s}.vcf.gz",
f1r2="results/called/{s}.f1r2.tar.gz"
threads: 32
resources:
mem_mb=2048
params:
normal_sample_name="{s}.N"
log:
"logs/call/Mutect2/{s}.log"
conda: "../envs/Mutect.yaml"
script:
"../scripts/Mutect.py"
Mutect.py
主要的代码
官方的wrapper有错误,所以自己改了下
with tempfile.TemporaryDirectory() as tmpdir:
shell(
"gatk --java-options '{java_opts}' Mutect2" # Tool and its subprocess
" --native-pair-hmm-threads {snakemake.threads}"
" {sam}" # Path to input mapping file
" --reference {snakemake.input.ref}" # Path to reference fasta file
" {normal}"
" --f1r2-tar-gz {snakemake.output.f1r2}" # Path to output f1r2 count file
" {germline_resource}" # Optional path to optional germline resource VCF
" {intervals}" # Optional path to optional bed intervals
" {pon} " # Optional path to panel of normals
" {extra}" # Extra parameters
" --tmp-dir {tmpdir}"
" --output {snakemake.output.vcf}" # Path to output vcf file
" {bam_output}" # Path to output bam file, optional
" {log}" # Logging behaviour
)
估计肿瘤样本样本之间交叉污染的 reads 数,估计每个肿瘤样本拷贝数片段。
输入cram文件需要提供参考基因组,但是wrapper没有提供输入参数,所以写在额外的参数extra中。
rule GetPileupSummaries:
input:
bam="results/prepared/{s}.{g}.cram",
intervals="path/to/captured_regions.bed",
variants="path/to/small_exac_common_3.hg38.vcf.gz",
output:
"results/called/{s}.{g}.getpileupsummaries.table"
threads: 32
resources:
mem_mb=1024
params:
extra="-R path/to/Homo_sapiens_assembly38.fasta"
log:
"logs/call/GetPileupSummaries/{s}.{g}.log"
wrapper:
"bio/gatk/getpileupsummaries"
rule CalculateContamination:
input:
getpileupsummaries_table="results/called/{s}.T.getpileupsummaries.table",
matched_normal="results/called/{s}.N.getpileupsummaries.table"
output:
contamination="results/called/{s}.contamination.table",
segmentation="results/called/{s}.segmentation.table",
conda:
"../envs/CalculateContamination.yaml"
log:
"logs/call/CalculateContamination/{s}.log",
script:
"../scripts/CalculateContamination.py"
从Mutect2的F1R2 count输出结果,来学习测序偏好,产生的结果可以用于FilterMutectCalls过滤。
rule LearnReadOrientationModel:
input:
f1r2="results/called/{s}.f1r2.tar.gz",
output:
"results/called/{s}.artifacts_prior.f1r2.tar.gz"
resources:
mem_mb=1024
log:
"logs/call/LearnReadOrientationModel/{s}.T.log"
wrapper:
"bio/gatk/learnreadorientationmodel"
contamination、segmentation:可选输入,过滤污染产生的变异
f1r2:可选输入,过滤测序偏好产生的变异
--max-alt-allele-count:一个位点最多有几种变异类型
rule FilterMutectCalls:
input:
vcf="results/called/{s}.vcf.gz",
ref="path/to/Homo_sapiens_assembly38.fasta",
intervals="path/to/captured_regions.bed",
contamination="results/called/{s}.contamination.table",
segmentation="results/called/{s}.segmentation.table",
f1r2="results/called/{s}.artifacts_prior.f1r2.tar.gz"
output:
vcf="results/called/{s}.filtered.vcf.gz"
log:
"logs/call/FilterMutectCalls/{s}.log"
params:
extra="--max-alt-allele-count 3"
resources:
mem_mb=1024
wrapper:
"bio/gatk/filtermutectcalls"
最后使用 picard 合并所有样本的 vcf 文件
rule merge_filtered:
input:
vcfs=expand("results/called/{s}.filtered.vcf.gz",s=samples.Sample),
output:
"results/called/all.vcf.gz",
log:
"logs/call/merge_filtered.log",
resources:
mem_mb=2048
wrapper:
"bio/picard/mergevcfs"
https://gatk.broadinstitute.org/hc/en-us/articles/360035894731-Somatic-short-variant-discovery-SNVs-Indels-
https://www.jianshu.com/p/4e994a171555
联系客服