构建美西螈10X单细胞转录组索引文件(可以参考之前的推文《 构建10X cellranger所需的其他物种基因组索引》),首先需要对gtf 文件过滤,执行下边的命令然后报错,信息如下。
cellranger-7.1.0/bin/cellranger mkgtf \
AmexT_v47-AmexG_v6.0-DD.gtf AmexT_v47-AmexG_v6.0-DD.filtered.gtf \
--attribute=gene_biotype:protein_coding \
--attribute=gene_biotype:lincRNA \
--attribute=gene_biotype:antisense \
--attribute=gene_biotype:IG_LV_gene \
--attribute=gene_biotype:IG_V_gene \
--attribute=gene_biotype:IG_V_pseudogene \
--attribute=gene_biotype:IG_D_gene \
--attribute=gene_biotype:IG_J_gene \
--attribute=gene_biotype:IG_J_pseudogene \
--attribute=gene_biotype:IG_C_gene \
--attribute=gene_biotype:IG_C_pseudogene \
--attribute=gene_biotype:TR_V_gene \
--attribute=gene_biotype:TR_V_pseudogene \
--attribute=gene_biotype:TR_D_gene \
--attribute=gene_biotype:TR_J_gene \
--attribute=gene_biotype:TR_J_pseudogene \
--attribute=gene_biotype:TR_C_gene
这个报错说明,gtf文件是有问题的,10X官网推荐的是ensemble的gtf文件,但是客户指定用他提供的gtf文件,所以需要对gtf文件进行一些处理,使之符合10X的要求。
Property 'transcript_id' has invalid whitespace character in GTF line 3: chr10p ambMex60DD exon 313039 314183 1000 + . gene_id "AMEX60DD000001"; transcript_id "ZFP37 [nr]|ZNF568 [hs]|AMEX60DD301000001.1"; exon_number "1";
Please fix your GTF and start again.
10X官网对gtf文件的要求如下
gene_id
, transcript_ids
, and gene_name
attributes.transcript_ids
for multiple gene_id
must be converted as unique (eg: unknown_transcript_1 fields)针对此gtf文件,可以采取以下处理,gtf文件第8列只保留gene_id
, transcript_ids
, gene_name
三个信息,并且删除其中多余的无用信息;删除gene_id
为空的列。
多余的无用信息比如下边这个
transcript_id的信息LOC114595135 [nr]|ZNF268 [hs]|,我们只需要AMEX60DD201000002.1;gene_name的信息我们只需要LOC114595135;
homolog、ORF_type、CDS都需要删除
gene_id "AMEX60DD000002";
transcript_id "LOC114595135 [nr]|ZNF268 [hs]|AMEX60DD201000002.1";
gene_name "LOC114595135 [nr]|ZNF268 [hs]|AMEX60DD201000002.1";
homolog "XP_028581273.1";
ORF_type "Putative short";
CDS "ATGGCATTTTTGCTCGACAAAACCCAATCTGGAGGCAGTTGCCTCCAAGAACAATTGTATCCGTGTTCTGAATGTGACAAACAATTCAGTCATGAAAGACATCTAACTCAACATGAAAGAACACACATTGGAGAAAAACCTTATCAGTGTCCTGAATGTCAAAAGAAATTCATTCGGAAAAGCCAACTCAGTATACATGAGAGAATCCATACTGGAGAAAAACCGTATCAATGCTCAGAATGTCAGAAAAGATTCAGTCAAAAAAGCAATCTGACTGATCATCAGAAAAAGCACACTGGAGAAAAACCTTATCAGTGTCCTGAATGTCAAAAGAGATTTAGTCGGAAAGAGAATCTAAGGCGACATGAGCAACAACACACTGGAGAAAAACCTCATCAGTGTCCTGAATGTCAAAAGAGATTCATTTGTAAAAGCCAACTGACAATACATGAGAGGATCCACACTGGAGAAAAACCTTATCAATGCTCAGAATGTCAGAAAAGATTCAGTCAAAAAAGCAATCTGACTGATCATCAGAAAAAACACACTGGAGAAAAACCTTATCAGTGTGCGGAATGTCAAAAGAGATTCAGTCGAAAAGAGAATCTAAAGCAACATGAGCAACAACACACTGGAGACAAACCTCACCGGTGCTCAGAATGTCCAAAAAGATTCATTTGGAAAAGCCAACTGACTATACATGAGAGAAACCACACTGGAGAAAAACCTTATCAATGCTCAGAATGTCAGAAAGGATTTGGTCAAAAAGGCTGTATGACTAAACATAAAAGAAAACATACTGGAGAAAAACCTTATCAATGCTCAGAATGTCAGAAAAGATTCAGTCACAAAAGTGATCTAATTAGACATAAAAGAACACACACTGGAGAAAAACCTTATCAGTGCTCAGAATGTCAAAAGAGATTCAGTCACAAAGGCCATCTGACACAACATGGGAGAATCCACACTGGAGAAAAACCTTATCAGTGTTCAGAATGTCAGAAAAGATTCAGTCACAAAAGCAGTCTGACTGAGCATGAGAGAATCCACACTGGAGAACAACCTTATCAGTGTTCAGAATGTCAGAAAGGATTCAGTCATAAAGGCCATCTGACTGATCATCAGAGAAAACACACTGGAGAAAAACCTTATAAATGCTTGGAATGTCAAAAAGAATTCTGTCACAAAATCAGTCTGACTGAGCATGAGAGAAAACACACTGGAGAAAAACCTTATCCGTGCTCAGAATGTCAGAAAAGATTCAGTCAAAAAAGCAATCTGACTGATCATCAGAGAAAACACACTGGAGAAAAACCTTATAAATGCTCAGAATGTCCACAAGAATTCAGTCACAAAATCAGTCTGACTGTGCATGAGAGAAAACACACTGGAGAAAAACCTTATCAGTGTCCTGAATGTCAGAAGAGATTCAGTCGGCAAGATAATCTAAGGCAACATGAGAAACAACACACTGGAGAAAAACCTCACCAGTGCTCAGAATGTCCAAAAAGATTCATTTGGAAAAGCCAACTGACTATACATGAGAGAAACCACACTGGAGAAAAACCTTATCAGTGCTCAGAATGTCAGAAAGGATTCAGTCAAAAAGGCAGTATGACTAAACATAAAAGAAAACATGCTGGAGAAAAACCTTATCAATGCTCAGAATGTCAGAAAAGATTCAGTCAGAAAAGTGATCTAATTAAACATAAAAGAACACACACTGGAGAAAAACCTTATCAGTGCTCAGAATGTCAAAAAAGATTCATTCAGAAAATCAATCTGACTATACATGAGAGAATCCACACTGGAGAACAACCTTATCAGTGCTCAGAATGTCAGAAAAGATTCAGTCACAAAAGCAGTCTGACTGAGCATGAGAGAAAACACACTGGAGAAAACATGAGTGCTCTGAACGTCGGAACATTATCAGTACTGAAGAGGATCGGACTAAATATTGGGCAAACACCCTGTAGAAAAACCATCAAAGTGCTTTGTTAG"; peptide "MAFLLDKTQSGGSCLQEQLYPCSECDKQFSHERHLTQHERTHIGEKPYQCPECQKKFIRKSQLSIHERIHTGEKPYQCSECQKRFSQKSNLTDHQKKHTGEKPYQCPECQKRFSRKENLRRHEQQHTGEKPHQCPECQKRFICKSQLTIHERIHTGEKPYQCSECQKRFSQKSNLTDHQKKHTGEKPYQCAECQKRFSRKENLKQHEQQHTGDKPHRCSECPKRFIWKSQLTIHERNHTGEKPYQCSECQKGFGQKGCMTKHKRKHTGEKPYQCSECQKRFSHKSDLIRHKRTHTGEKPYQCSECQKRFSHKGHLTQHGRIHTGEKPYQCSECQKRFSHKSSLTEHERIHTGEQPYQCSECQKGFSHKGHLTDHQRKHTGEKPYKCLECQKEFCHKISLTEHERKHTGEKPYPCSECQKRFSQKSNLTDHQRKHTGEKPYKCSECPQEFSHKISLTVHERKHTGEKPYQCPECQKRFSRQDNLRQHEKQHTGEKPHQCSECPKRFIWKSQLTIHERNHTGEKPYQCSECQKGFSQKGSMTKHKRKHAGEKPYQCSECQKRFSQKSDLIKHKRTHTGEKPYQCSECQKRFIQKINLTIHERIHTGEQPYQCSECQKRFSHKSSLTEHERKHTGENMSALNVGTLSVLKRIGLNIGQTPCRKTIKVLC*";
python脚本如下,其中bioquest是我写的一些数据处理的函数,可以在找到https://jihulab.com/NshimaBio/bioquest
import sys
import argparse
import numpy as np
import pandas as pd
import re
import bioquest as bq
in_file = "AmexT_v47-AmexG_v6.0-DD.gtf"
out_file = "AmexT_v47-AmexG_v6.0-DD.10X.gtf"
ifile = open(in_file,'r')
ofile = open(out_file,'w')
for line in ifile:
table = line.split('\t')
feature = table[8].split(';')
new_feature = list()
feature = bq.st.greps(string=feature,pattern="(gene_id)|(transcript_id)|(gene_name)")
gene_id=bq.st.greps(string=feature,pattern="gene_id")
if gene_id:
new_feature.append(gene_id[0])
else:
next
transcript_id=bq.st.greps(string=feature,pattern="transcript_id")
if transcript_id:
transcript_id = re.search('AMEX60DD.*\\.\d',transcript_id[0]).group()
new_feature.append(f'transcript_id "{transcript_id}"')
gene_name=bq.st.greps(string=feature,pattern="gene_name")
if gene_name:
gene_name = bq.st.remove(string=gene_name[0],pattern='(\s)|(\\[.*\\].*)|(gene_name)|(")|(\\|.*)')
new_feature.append(f'gene_name "{gene_name}"')
table[8] = "; ".join(new_feature)
new_lines = "\t".join(table)
ofile.write(new_lines+"\n")
ifile.close()
ofile.close()
https://kb.10xgenomics.com/hc/en-us/articles/4707448154381-Common-mkref-formatting-errors-when-building-custom-reference-from-NCBI-UCSC-or-RefSeq-genomes
联系客服