chr size size2
1 chr1 248956422 249M
2 chr2 242193529 242M
3 chr3 198295559 198M
4 chr4 190214555 190M
5 chr5 181538259 182M
6 chr6 170805979 171M
7 chr7 159345973 159M
8 chrX 156040895 156M
9 chr8 145138636 145M
10 chr9 138394717 138M
11 chr11 135086622 135M
12 chr10 133797422 134M
13 chr12 133275309 133M
14 chr13 114364328 114M
15 chr14 107043718 107M
16 chr15 101991189 102M
17 chr16 90338345 90M
18 chr17 83257441 83M
19 chr18 80373285 80M
20 chr20 64444167 64M
21 chr19 58617616 59M
22 chrY 57227415 57M
23 chr22 50818468 51M
24 chr21 46709983 47M
25 SUM 3088269832 3088M
#未考虑M线粒体,其长度较短,为16569,16Kbp,
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
gunzip hg38.fa.gz
#提取染色体id
grep '^>' hg38.fa > chr.id
wc -l chr.id
#455 chr.id
head chr.id
####
>chr1
>chr10
>chr11
>chr11_KI270721v1_random
>chr12
>chr13
>chr14
>chr14_GL000009v2_random
>chr14_GL000225v1_random
>chr14_KI270722v1_random
在此之前需要简单了解由最初的测序read数据组装成基因组的染色体序列需要经历contigs与scaffolds两个过程,如下图所示。contigs是依靠read间的重叠拼接的序列(a few kbp long),特点是不含有N碱基;scaffolds则主要依靠read pairs关系进一步拼接contigs,特点是会产生N碱基(a few hundred kbp);最终由scaffolds拼接成染色体序列。
grep 'random' chr.id > chr.random
wc -l chr.random
#42 chr.random
head chr.random
###
>chr11_KI270721v1_random
>chr14_GL000009v2_random
>chr14_GL000225v1_random
>chr14_KI270722v1_random
>chr14_GL000194v1_random
>chr14_KI270723v1_random
>chr14_KI270724v1_random
>chr14_KI270725v1_random
>chr14_KI270726v1_random
grep 'chrUn' chr.id > chr.chrUn
wc -l chr.chrUn
#127 chr.chrUn
head chr.chrUn
###
>chrUn_KI270302v1
>chrUn_KI270304v1
>chrUn_KI270303v1
>chrUn_KI270305v1
>chrUn_KI270322v1
>chrUn_KI270320v1
>chrUn_KI270310v1
>chrUn_KI270316v1
>chrUn_KI270315v1
>chrUn_KI270312v1
grep 'alt' chr.id > chr.alt
wc -l chr.alt
#261 chr.alt
head chr.alt
###
>chr1_KI270762v1_alt
>chr1_KI270766v1_alt
>chr1_KI270760v1_alt
>chr1_KI270765v1_alt
>chr1_GL383518v1_alt
>chr1_GL383519v1_alt
>chr1_GL383520v2_alt
>chr1_KI270764v1_alt
>chr1_KI270763v1_alt
>chr1_KI270759v1_alt
注意:以上具体的chromosome name均为ucsc的hg版本,与GRCh38略有差异,但基本也是这几种类型sequence
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.refGene.gtf.gz
awk '{print$1, $10}' hg38.refGene.gtf |sort -k 2|uniq|grep -v alt | grep -v random | grep -v alt | grep -v fix| sort -k 1 > chr.gene
cut -d' ' -f 1 chr.gene | uniq -c
###
1113 chr10
1676 chr11
1392 chr12
632 chr13
946 chr14
1010 chr15
1146 chr16
1574 chr17
434 chr18
1791 chr19
2832 chr1
780 chr20
414 chr21
644 chr22
1817 chr2
1563 chr3
1088 chr4
1313 chr5
1453 chr6
1341 chr7
1029 chr8
1114 chr9
1 chrM
1157 chrX
143 chrY
目前常用的基因组版本为GRCh38/37,hg38/19,前者可通过NCBI/Ensembl下载,后者可通过UCSC网站下载。如下图所示GRCh38可认为等同于hg38,GRCh37可认为等同于hg19。
wget -c ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/annotation/GRCh38_latest/refseq_identifiers/GRCh38_latest_genomic.fna.gz
wget -c http://ftp.ensembl.org/pub/release-103/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz
wget -c http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
联系客服