打开APP
userphoto
未登录

开通VIP,畅享免费电子书等14项超值服

开通VIP
02.GATK肿瘤基因变异最佳实践SnakeMake流程:Call 变异

代码地址

https://jihulab.com/BioQuest/smkhss
https://github.com/BioQuestX/smkhss

Mutect

与HaplotypeCaller类似,Mutect2通过active region单倍型的局部从头组装,同时Call SNV和INDEL变异。

Mutect有三种模式,这里使用肿瘤与正常配对模式;

Mutect为了节省计算不再考虑配对样本中的 germline 变异;

panel-of-normals :可选输入,作为Mutect Call 变异时的噪音参考;

germline-resource:可选输入,包含等位基因部分的种系测序群体vcf;

f1r2:F1R2 counts 可以提供给LearnReadOrientationModel作为输入

rule Mutect:
    input:
        ref="path/to/Homo_sapiens_assembly38.fasta",
        sam=["results/prepared/{s}.T.cram","results/prepared/{s}.N.cram"],
        sam_idx=["results/prepared/{s}.T.cram.crai","results/prepared/{s}.N.cram.crai"],
        intervals="path/to/captured_regions.bed",
        germline="path/to/af-only-gnomad.hg38.vcf.gz",
        germline_idx="path/to/af-only-gnomad.hg38.vcf.gz.tbi",
        pon=gatk_dict"path/to/1000g_pon.hg38.vcf.gz",
        pon_idx="path/to/1000g_pon.hg38.vcf.gz.tbi",
    output:
        vcf="results/called/{s}.vcf.gz",
        f1r2="results/called/{s}.f1r2.tar.gz"
    threads: 32
    resources:
        mem_mb=2048
    params:
        normal_sample_name="{s}.N"
    log:
        "logs/call/Mutect2/{s}.log"
    conda: "../envs/Mutect.yaml"
    script:
        "../scripts/Mutect.py"

Mutect.py主要的代码

官方的wrapper有错误,所以自己改了下

with tempfile.TemporaryDirectory() as tmpdir:
    shell(
        "gatk --java-options '{java_opts}' Mutect2"  # Tool and its subprocess
        " --native-pair-hmm-threads {snakemake.threads}"
        " {sam}"  # Path to input mapping file
        " --reference {snakemake.input.ref}"  # Path to reference fasta file
        " {normal}"
        " --f1r2-tar-gz {snakemake.output.f1r2}"  # Path to output f1r2 count file
        " {germline_resource}"  # Optional path to optional germline resource VCF
        " {intervals}"  # Optional path to optional bed intervals
        " {pon} "  # Optional path to panel of normals
        " {extra}"  # Extra parameters
        " --tmp-dir {tmpdir}"
        " --output {snakemake.output.vcf}"  # Path to output vcf file
        " {bam_output}"  # Path to output bam file, optional
        " {log}"  # Logging behaviour
    )

计算污染信息

估计肿瘤样本样本之间交叉污染的 reads 数,估计每个肿瘤样本拷贝数片段。

输入cram文件需要提供参考基因组,但是wrapper没有提供输入参数,所以写在额外的参数extra中。

rule GetPileupSummaries:
    input:
        bam="results/prepared/{s}.{g}.cram",
        intervals="path/to/captured_regions.bed",
        variants="path/to/small_exac_common_3.hg38.vcf.gz",
    output:
        "results/called/{s}.{g}.getpileupsummaries.table"
    threads: 32
    resources:
        mem_mb=1024
    params:
        extra="-R path/to/Homo_sapiens_assembly38.fasta"
    log:
        "logs/call/GetPileupSummaries/{s}.{g}.log"
    wrapper:
        "bio/gatk/getpileupsummaries"

rule CalculateContamination:
    input:
        getpileupsummaries_table="results/called/{s}.T.getpileupsummaries.table",
        matched_normal="results/called/{s}.N.getpileupsummaries.table"
    output:
        contamination="results/called/{s}.contamination.table",
        segmentation="results/called/{s}.segmentation.table",
    conda:
        "../envs/CalculateContamination.yaml"
    log:
        "logs/call/CalculateContamination/{s}.log",
    script:
        "../scripts/CalculateContamination.py"

学习测序偏好

从Mutect2的F1R2 count输出结果,来学习测序偏好,产生的结果可以用于FilterMutectCalls过滤。

rule LearnReadOrientationModel:
    input:
        f1r2="results/called/{s}.f1r2.tar.gz",
    output:
        "results/called/{s}.artifacts_prior.f1r2.tar.gz"
    resources:
        mem_mb=1024
    log:
        "logs/call/LearnReadOrientationModel/{s}.T.log"
    wrapper:
        "bio/gatk/learnreadorientationmodel"

过滤变异

contamination、segmentation:可选输入,过滤污染产生的变异

f1r2:可选输入,过滤测序偏好产生的变异

--max-alt-allele-count:一个位点最多有几种变异类型

rule FilterMutectCalls:
    input:
        vcf="results/called/{s}.vcf.gz",
        ref="path/to/Homo_sapiens_assembly38.fasta",
        intervals="path/to/captured_regions.bed",
        contamination="results/called/{s}.contamination.table",
        segmentation="results/called/{s}.segmentation.table",
        f1r2="results/called/{s}.artifacts_prior.f1r2.tar.gz"
    output:
        vcf="results/called/{s}.filtered.vcf.gz"
    log:
        "logs/call/FilterMutectCalls/{s}.log"
    params:
        extra="--max-alt-allele-count 3"
    resources:
        mem_mb=1024
    wrapper:
        "bio/gatk/filtermutectcalls"

合并VCF

最后使用 picard 合并所有样本的 vcf 文件

rule merge_filtered:
    input:
        vcfs=expand("results/called/{s}.filtered.vcf.gz",s=samples.Sample),
    output:
        "results/called/all.vcf.gz",
    log:
        "logs/call/merge_filtered.log",
    resources:
        mem_mb=2048
    wrapper:
        "bio/picard/mergevcfs"

reference

https://gatk.broadinstitute.org/hc/en-us/articles/360035894731-Somatic-short-variant-discovery-SNVs-Indels-
https://www.jianshu.com/p/4e994a171555
本站仅提供存储服务,所有内容均由用户发布,如发现有害或侵权内容,请点击举报
打开APP,阅读全文并永久保存 查看更多类似文章
猜你喜欢
类似文章
【热】打开小程序,算一算2024你的财运
一步一步用Snakemake搭建gatk4生成正常样本的germline突变数据库的流程
肿瘤WES项目实战演练分享及学习班通知
GATK4最佳实践-体细胞突变的检测与识别
vcf文件合并
变异信息那些事(中)~初级找变异
从零开始完整学习全基因组测序(WGS)数据分析:第4节 构建WGS主流程 | Public Library of Bioinformatics
更多类似文章 >>
生活服务
热点新闻
分享 收藏 导长图 关注 下载文章
绑定账号成功
后续可登录账号畅享VIP特权!
如果VIP功能使用有故障,
可点击这里联系客服!

联系客服