打开APP
userphoto
未登录

开通VIP,畅享免费电子书等14项超值服

开通VIP
超2万样本的RNA-seq数据重新统一处理(TCGA+GTEx+ TARGET)

各种大型计划产出的RNA-seq数据资源已经非常丰富了,但是大家都想把多个数据库联合起来分析,就不得不面对批次效应这个问题,所以UCSC团队就使用统一的流程把这些数据重新处理了,在亚马逊云上,一个样本花费1.3美元。

发表在:Nature Biotechnology publication: https://doi.org/10.1038/nbt.3772

3大数据库是:

  1. The Cancer Genome Atlas (TCGA)

  2. Genotype-Tissue Expression (GTEx)

  3. Therapeutically Applicable Research To Generate Effective Treatments (TARGET)

而且还提供网页工具供查询使用:

Differential gene and isoform expression of FOXM1 transcription factor in TCGA vs. GTEx

使用的数据处理流程

如下图: CutAdapt was used for adapter trimming, STAR was used for alignment, and RSEM and Kallisto were used as quantifiers.

流程介绍

如果你对RNA-seq数据处理流程有意外,直接去看我长达74个小时全套生物信息学入门视频:生信技能树视频课程学习路径,这么好的视频还免费!

参考基因组选择

  • STAR, RSEM, and Kallisto indexes were all built with the same reference genome. HG38 (no alt analysis) with overlapping genes from the PAR locus removed (chrY:10,000-2,781,479 and chrY:56,887,902-57,217,415).

  • ftp://ftp.ncbi.nlm.nih.gov/genomes/archive/old_genbank/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines

注释文件的选择

  • RSEM: Gencode V23 comprehensive annotation (CHR)

  • http://www.gencodegenes.org/releases/23.html first row

  • Kallisto: Gencode V23 comprehensive annotation (ALL)

  • http://www.gencodegenes.org/releases/23.html second row

软件参数的选择

  • STAR

  • sudo docker run -v $(pwd):/data quay.io/ucsc_cgl/star --runThreadN 32 --runMode genomeGenerate --genomeDir /data/genomeDir --genomeFastaFiles hg38.fa --sjdbGTFfile gencode.v23.annotation.gtf

  • Kallisto

  • sudo docker run -v $(pwd):/data quay.io/ucsc_cgl/kallisto index -i hg38.gencodeV23.transcripts.idx transcriptome_hg38_gencodev23.fasta

  • Kallisto index that was used during the recompute is available here.

  • RSEM

  • sudo docker run -v $(pwd):/data --entrypoint=rsem-prepare-reference jvivian/rsem -p 4 --gtf gencode.v23.annotation.gtf hg38.fa hg38

可以看到,上面的3大要素, 就是我五年前在 生信菜鸟团博客写教程的基本规律。

Raw data

Nature Publication Supplementary Note 7 – Data Availability

Submitter sample ID to Xena sample ID mapping

TCGA mapping

GTEx mapping

TARGET mapping

最后公布的可供下载的数据集

  • GTEX (11 datasets)

  • TARGET Pan-Cancer (PANCAN) (12 datasets)

  • TCGA and TARGET Pan-Cancer (PANCAN) (4 datasets)

  • TCGA Pan-Cancer (PANCAN) (10 datasets)

  • TCGA TARGET GTEx (13 datasets)

其中TCGA TARGET GTEx 3大数据库) (共有 13 datasets)

cohort: TCGA TARGET GTEx

表达矩阵样本量很可观

  • RSEM expected_count

    (n=19,109)

    UCSC Toil RNAseq Recompute

  • RSEM expected_count (DESeq2 standardized)

    (n=19,039)

    UCSC Toil RNAseq Recompute

    RSEM expected_count output normalized using DESeq2

  • RSEM fpkm

    (n=19,131)

    UCSC Toil RNAseq Recompute

  • RSEM norm_count

    (n=19,120)

    UCSC Toil RNAseq Recompute

    TCGA TARGET GTEx gene expression by UCSC TOIL RNA-seq recompute

  • RSEM tpm

    (n=19,131)

    UCSC Toil RNAseq Recompute

phenotype

  • TCGA GTEX main categories

    (n=17,221)

    UCSC Toil RNAseq Recompute

  • TCGA survival data

    (n=10,496)

    UCSC Toil RNAseq Recompute

  • TCGA TARGET GTEX selected phenotypes

    (n=19,131)

    UCSC Toil RNAseq Recompute

somatic mutation (SNP and INDEL)

  • TCGA somatic mutations (Pan-cancer Atlas MC3 public version)

    (n=8,463)

    UCSC Toil RNAseq Recompute

transcript expression RNAseq

  • RSEM expected_count

    (n=19,109)

    UCSC Toil RNAseq Recompute

    TCGA TARGET GTEx transcript expression by RSEM using UCSC TOIL RNA-seq recompute

  • RSEM fpkm

    (n=19,129)

    UCSC Toil RNAseq Recompute

    TCGA TARGET GTEx transcript expression by RSEM using UCSC TOIL RNA-seq recompute

  • RSEM isoform percentage

    (n=19,131)

    UCSC Toil RNAseq Recompute

    TCGA TARGET GTEx transcript expression by RSEM using UCSC TOIL RNA-seq recompute

  • RSEM tpm

    (n=19,131)

    UCSC Toil RNAseq Recompute

    TCGA TARGET GTEx transcript expression by RSEM using UCSC TOIL RNA-seq recompute

本站仅提供存储服务,所有内容均由用户发布,如发现有害或侵权内容,请点击举报
打开APP,阅读全文并永久保存 查看更多类似文章
猜你喜欢
类似文章
零代码下载TCGA数据库数据,超详细步骤解密!
TCGA、GTEx的泛癌数据也是1行代码整理
RNA-Seq选择参考基因组
去TCGA看表型,来CistromeCancer挖机制 | RNA-seq和ChIP-seq的完美结...
GTEx数据库-TCGA数据挖掘的好帮手
癌症组学大数据的可视化与再挖掘
更多类似文章 >>
生活服务
热点新闻
分享 收藏 导长图 关注 下载文章
绑定账号成功
后续可登录账号畅享VIP特权!
如果VIP功能使用有故障,
可点击这里联系客服!

联系客服