打开APP
userphoto
未登录

开通VIP,畅享免费电子书等14项超值服

开通VIP
构建10X单细胞转录组索引时报错Property has invalid whitespace character in GTF

构建美西螈10X单细胞转录组索引文件(可以参考之前的推文《 构建10X cellranger所需的其他物种基因组索引》),首先需要对gtf 文件过滤,执行下边的命令然后报错,信息如下。

cellranger-7.1.0/bin/cellranger mkgtf \
  AmexT_v47-AmexG_v6.0-DD.gtf AmexT_v47-AmexG_v6.0-DD.filtered.gtf \
  --attribute=gene_biotype:protein_coding \
  --attribute=gene_biotype:lincRNA \
  --attribute=gene_biotype:antisense \
  --attribute=gene_biotype:IG_LV_gene \
  --attribute=gene_biotype:IG_V_gene \
  --attribute=gene_biotype:IG_V_pseudogene \
  --attribute=gene_biotype:IG_D_gene \
  --attribute=gene_biotype:IG_J_gene \
  --attribute=gene_biotype:IG_J_pseudogene \
  --attribute=gene_biotype:IG_C_gene \
  --attribute=gene_biotype:IG_C_pseudogene \
  --attribute=gene_biotype:TR_V_gene \
  --attribute=gene_biotype:TR_V_pseudogene \
  --attribute=gene_biotype:TR_D_gene \
  --attribute=gene_biotype:TR_J_gene \
  --attribute=gene_biotype:TR_J_pseudogene \
  --attribute=gene_biotype:TR_C_gene


这个报错说明,gtf文件是有问题的,10X官网推荐的是ensemble的gtf文件,但是客户指定用他提供的gtf文件,所以需要对gtf文件进行一些处理,使之符合10X的要求。

Property 'transcript_id' has invalid whitespace character in GTF line 3: chr10p ambMex60DD      exon    313039  314183  1000    +       .       gene_id "AMEX60DD000001"; transcript_id "ZFP37 [nr]|ZNF568 [hs]|AMEX60DD301000001.1"; exon_number "1"
Please fix your GTF and start again.

10X官网对gtf文件的要求如下

  1. Recommended to retain only gene_id, transcript_ids, and gene_nameattributes.
  2. Verify for any redundancy and order genes in the annotation file
  3. Replace or remove the gene_ids that have empty values.
  4. Duplicate transcript_ids for multiple gene_idmust be converted as unique (eg: unknown_transcript_1 fields)
  5. Finally, make sure that all annotation records for a single gene are found together in order, one after the other (this step will need custom scripts).

针对此gtf文件,可以采取以下处理,gtf文件第8列只保留gene_id, transcript_ids,  gene_name三个信息,并且删除其中多余的无用信息;删除gene_id为空的列。

多余的无用信息比如下边这个

transcript_id的信息LOC114595135 [nr]|ZNF268 [hs]|,我们只需要AMEX60DD201000002.1;gene_name的信息我们只需要LOC114595135;

homolog、ORF_type、CDS都需要删除

gene_id "AMEX60DD000002"
transcript_id "LOC114595135 [nr]|ZNF268 [hs]|AMEX60DD201000002.1"
gene_name "LOC114595135 [nr]|ZNF268 [hs]|AMEX60DD201000002.1"
homolog "XP_028581273.1"
ORF_type "Putative short"
CDS "ATGGCATTTTTGCTCGACAAAACCCAATCTGGAGGCAGTTGCCTCCAAGAACAATTGTATCCGTGTTCTGAATGTGACAAACAATTCAGTCATGAAAGACATCTAACTCAACATGAAAGAACACACATTGGAGAAAAACCTTATCAGTGTCCTGAATGTCAAAAGAAATTCATTCGGAAAAGCCAACTCAGTATACATGAGAGAATCCATACTGGAGAAAAACCGTATCAATGCTCAGAATGTCAGAAAAGATTCAGTCAAAAAAGCAATCTGACTGATCATCAGAAAAAGCACACTGGAGAAAAACCTTATCAGTGTCCTGAATGTCAAAAGAGATTTAGTCGGAAAGAGAATCTAAGGCGACATGAGCAACAACACACTGGAGAAAAACCTCATCAGTGTCCTGAATGTCAAAAGAGATTCATTTGTAAAAGCCAACTGACAATACATGAGAGGATCCACACTGGAGAAAAACCTTATCAATGCTCAGAATGTCAGAAAAGATTCAGTCAAAAAAGCAATCTGACTGATCATCAGAAAAAACACACTGGAGAAAAACCTTATCAGTGTGCGGAATGTCAAAAGAGATTCAGTCGAAAAGAGAATCTAAAGCAACATGAGCAACAACACACTGGAGACAAACCTCACCGGTGCTCAGAATGTCCAAAAAGATTCATTTGGAAAAGCCAACTGACTATACATGAGAGAAACCACACTGGAGAAAAACCTTATCAATGCTCAGAATGTCAGAAAGGATTTGGTCAAAAAGGCTGTATGACTAAACATAAAAGAAAACATACTGGAGAAAAACCTTATCAATGCTCAGAATGTCAGAAAAGATTCAGTCACAAAAGTGATCTAATTAGACATAAAAGAACACACACTGGAGAAAAACCTTATCAGTGCTCAGAATGTCAAAAGAGATTCAGTCACAAAGGCCATCTGACACAACATGGGAGAATCCACACTGGAGAAAAACCTTATCAGTGTTCAGAATGTCAGAAAAGATTCAGTCACAAAAGCAGTCTGACTGAGCATGAGAGAATCCACACTGGAGAACAACCTTATCAGTGTTCAGAATGTCAGAAAGGATTCAGTCATAAAGGCCATCTGACTGATCATCAGAGAAAACACACTGGAGAAAAACCTTATAAATGCTTGGAATGTCAAAAAGAATTCTGTCACAAAATCAGTCTGACTGAGCATGAGAGAAAACACACTGGAGAAAAACCTTATCCGTGCTCAGAATGTCAGAAAAGATTCAGTCAAAAAAGCAATCTGACTGATCATCAGAGAAAACACACTGGAGAAAAACCTTATAAATGCTCAGAATGTCCACAAGAATTCAGTCACAAAATCAGTCTGACTGTGCATGAGAGAAAACACACTGGAGAAAAACCTTATCAGTGTCCTGAATGTCAGAAGAGATTCAGTCGGCAAGATAATCTAAGGCAACATGAGAAACAACACACTGGAGAAAAACCTCACCAGTGCTCAGAATGTCCAAAAAGATTCATTTGGAAAAGCCAACTGACTATACATGAGAGAAACCACACTGGAGAAAAACCTTATCAGTGCTCAGAATGTCAGAAAGGATTCAGTCAAAAAGGCAGTATGACTAAACATAAAAGAAAACATGCTGGAGAAAAACCTTATCAATGCTCAGAATGTCAGAAAAGATTCAGTCAGAAAAGTGATCTAATTAAACATAAAAGAACACACACTGGAGAAAAACCTTATCAGTGCTCAGAATGTCAAAAAAGATTCATTCAGAAAATCAATCTGACTATACATGAGAGAATCCACACTGGAGAACAACCTTATCAGTGCTCAGAATGTCAGAAAAGATTCAGTCACAAAAGCAGTCTGACTGAGCATGAGAGAAAACACACTGGAGAAAACATGAGTGCTCTGAACGTCGGAACATTATCAGTACTGAAGAGGATCGGACTAAATATTGGGCAAACACCCTGTAGAAAAACCATCAAAGTGCTTTGTTAG"; peptide "MAFLLDKTQSGGSCLQEQLYPCSECDKQFSHERHLTQHERTHIGEKPYQCPECQKKFIRKSQLSIHERIHTGEKPYQCSECQKRFSQKSNLTDHQKKHTGEKPYQCPECQKRFSRKENLRRHEQQHTGEKPHQCPECQKRFICKSQLTIHERIHTGEKPYQCSECQKRFSQKSNLTDHQKKHTGEKPYQCAECQKRFSRKENLKQHEQQHTGDKPHRCSECPKRFIWKSQLTIHERNHTGEKPYQCSECQKGFGQKGCMTKHKRKHTGEKPYQCSECQKRFSHKSDLIRHKRTHTGEKPYQCSECQKRFSHKGHLTQHGRIHTGEKPYQCSECQKRFSHKSSLTEHERIHTGEQPYQCSECQKGFSHKGHLTDHQRKHTGEKPYKCLECQKEFCHKISLTEHERKHTGEKPYPCSECQKRFSQKSNLTDHQRKHTGEKPYKCSECPQEFSHKISLTVHERKHTGEKPYQCPECQKRFSRQDNLRQHEKQHTGEKPHQCSECPKRFIWKSQLTIHERNHTGEKPYQCSECQKGFSQKGSMTKHKRKHAGEKPYQCSECQKRFSQKSDLIKHKRTHTGEKPYQCSECQKRFIQKINLTIHERIHTGEQPYQCSECQKRFSHKSSLTEHERKHTGENMSALNVGTLSVLKRIGLNIGQTPCRKTIKVLC*"

python脚本如下,其中bioquest是我写的一些数据处理的函数,可以在找到https://jihulab.com/NshimaBio/bioquest

import sys
import argparse
import numpy as np
import pandas as pd
import re
import bioquest as bq

in_file = "AmexT_v47-AmexG_v6.0-DD.gtf"
out_file = "AmexT_v47-AmexG_v6.0-DD.10X.gtf"
ifile = open(in_file,'r')
ofile = open(out_file,'w')

for line in ifile:
    table = line.split('\t')
    feature = table[8].split(';')
    new_feature = list()
    feature = bq.st.greps(string=feature,pattern="(gene_id)|(transcript_id)|(gene_name)")
    gene_id=bq.st.greps(string=feature,pattern="gene_id")
    if gene_id:
        new_feature.append(gene_id[0])
    else:
        next
    transcript_id=bq.st.greps(string=feature,pattern="transcript_id")
    if transcript_id:
        transcript_id = re.search('AMEX60DD.*\\.\d',transcript_id[0]).group()
        new_feature.append(f'transcript_id "{transcript_id}"')
    gene_name=bq.st.greps(string=feature,pattern="gene_name")
    if gene_name:
        gene_name = bq.st.remove(string=gene_name[0],pattern='(\s)|(\\[.*\\].*)|(gene_name)|(")|(\\|.*)')
        new_feature.append(f'gene_name "{gene_name}"')
    table[8] = "; ".join(new_feature)
    new_lines = "\t".join(table)
    ofile.write(new_lines+"\n")

ifile.close()
ofile.close()

Reference

https://kb.10xgenomics.com/hc/en-us/articles/4707448154381-Common-mkref-formatting-errors-when-building-custom-reference-from-NCBI-UCSC-or-RefSeq-genomes

本站仅提供存储服务,所有内容均由用户发布,如发现有害或侵权内容,请点击举报
打开APP,阅读全文并永久保存 查看更多类似文章
猜你喜欢
类似文章
【热】打开小程序,算一算2024你的财运
TCGA数据下载与ID转换
伸出我的小脚,将TCGA轻轻绊倒,然后叉腰哈哈笑
不同物种的的10x单细胞转录组参考数据文件构建
10x单细胞数据分析之整理参考基因组
TCGA lncRNA的提取 | 生信笔记
从TCGA数据中提取lncRNA(这是一个找bug教程)
更多类似文章 >>
生活服务
热点新闻
分享 收藏 导长图 关注 下载文章
绑定账号成功
后续可登录账号畅享VIP特权!
如果VIP功能使用有故障,
可点击这里联系客服!

联系客服