使用InterProScan进行GO注释

InterProScan 简介

InterPro是一个整合了多个数据库资源的蛋白质功能预测数据库。它可以用于注释一个蛋白包含什么功能域/位点、属于什么家族等信息。

如果你有新的核算序列或者蛋白质序列，并且想知道该序列的功能是什么，你可以使用InterProScan的扫描算法对InterPro数据库进行集成式搜索。

软件大小：8.9G
只有Linux版本
网页版一次只能分析一条蛋白序列，且只能是长度小于40000氨基酸的蛋白质序列。
序列必须是FASTA格式
可以指定InterPro包含的数据库
输出格式有多种可选[1]

TSV: A simple tab-delimited file format
XML: The “IMPACT” XML format (XSD available here).
JSON: Full output of results in JSON format
GFF3: The GFF 3.0 format
HTML: An HTML representation of the protein matches
SVG: An Scalable Vector Graphics representation of the protein matches

需要的安装环境[2]

64-bit Linux
Perl 5 (default on most Linux distributions)
Python 3 (InterProScan 5.30-69.0 onwards)
Oracle’s Java JDK/JRE version 8 (InterProScan 5.17-56.0 onwards)
Environment variables set

$JAVA_HOME should point to the location of the JVM
$JAVA_HOME/bin should be added to the $PATH

安装与使用

下载与安装核心 InterProScan 软件

# 下载cd ~/srcwget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.30-69.0/interproscan-5.30-69.0-64-bit.tar.gz# 由于文件比较大，建议查看一下是不是下载完整了：wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.30-69.0/interproscan-5.30-69.0-64-bit.tar.gz.md5md5sum -c interproscan-5.30-69.0-64-bit.tar.gz.md5# 应当返回如下文字*interproscan-5.30-69.0-64-bit.tar.gz: OK*# 否则需要重新下载# 解压tar -zxvf interproscan-5.30-69.0-64-bit.tar.gz# 把interproscan添加到环境变量echo 'export PATH=~/src/interproscan-5.30-69.0:$PATH' >> ~/.bashrcsource ~/.bashrc# 运行查看软件的相关命令interproscan.sh

InterProScan 支持的数据库分析

InterProScan 支持的数据库分析包括如下18种[3]，其中有些软件是内置可以用，有些则需要使用第三方提供的许可代码和数据方可使用。如果你想运行这些分析，则需要从供应商获取许可证，并配置本地的InterProScan 安装来使用。关于如何激活这些分析，请参考官方说明：Activating Licensed Analyses。

支持的数据库名称	是否在InterProScan直接可用	功能描述
CDD	√	Prediction of CDD domains in Proteins
COILS	√	Prediction of Coiled Coil Regions in Proteins
Gene3D	√	Structural assignment for whole genes and genomes using the CATH domain structure database
HAMAP	√	High-quality Automated and Manual Annotation of Microbial Proteomes
MOBIDB	√	Prediction of disordered domains Regions in Proteins
PANTHER	√ 但需要额外安装	The PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System was designed to classify proteins (and their genes) in order to facilitate high-throughput analysis.
Pfam	√	A large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs)
PIRSF	√	The PIRSF concept is being used as a guiding principle to provide comprehensive and non-overlapping clustering of UniProtKB sequences into a hierarchical order to reflect their evolutionary relationships.
PRINTS	√	A fingerprint is a group of conserved motifs used to characterise a protein family
ProDom	√	ProDom is a comprehensive set of protein domain families automatically generated from the UniProt Knowledge Database.
PROSITE (包括Profiles 和 Patterns两种分析)	√	PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them
SFLD	√	SFLDs are protein families based on Hidden Markov Models or HMMs
SMART(licensed components)	√	SMART allows the identification and analysis of domain architectures based on Hidden Markov Models or HMMs，默认安装的是未注册版本，该分析包含了简化的后处理过程，包括E-value过滤器，但是可能跟完全注册版结果不一样。
SUPERFAMILY	√	SUPERFAMILY is a database of structural and functional annotation for all proteins and genomes.
TIGRFAMs	√	TIGRFAMs are protein families based on Hidden Markov Models or HMMs
Phobius (licensed software)		A combined transmembrane topology and signal peptide predictor
SignalP		SignalP server predicts the presence and location of signal peptide cleavage sites in amino acid sequences from different organisms: Gram-positive prokaryotes, Gram-negative prokaryotes, and eukaryotes. The method incorporates a prediction of cleavage sites and a signal peptide/non-signal peptide prediction based on a combination of several artificial neural networks.
TMHMM		Prediction of transmembrane helices in proteins

其中Panther需要额外安装。

安装 Panther Models

Panther简介

PANTHER (Protein ANalysis THrough Evolutionary Relationships) 分类系统目的是分类蛋白（及其基因）。蛋白分类包括[4]：

家族和亚家族：家族是进化上相关的蛋白类别；亚家族是有相同功能的相关蛋白
Molecular function：蛋白本身或在生化水平与别的行使同一个功能的其他蛋白直接互作的功能，如蛋白激酶
Biological process：蛋白在一个更大互作蛋白网络背景下，在细胞或者组织水平，完成一个生物学过程的功能。如有丝分裂
Pathway: 类似于生物过程（Biological process），但是Pathway还明确指明了相互作用分子之间的关系。

当输入下边命令：

interproscan.sh

可以看到，PANTHER在Deactivated analyses中

Available analyses:                      TIGRFAM (15.0) : TIGRFAMs are protein families based on Hidden Markov Models or HMMs                         SFLD (3) : SFLDs are protein families based on Hidden Markov Models or HMMs                  SUPERFAMILY (1.75) : SUPERFAMILY is a database of structural and functional annotation for all proteins and genomes.                       Gene3D (4.2.0) : Structural assignment for whole genes and genomes using the CATH domain structure database                        Hamap (2018_03) : High-quality Automated and Manual Annotation of Microbial Proteomes                        Coils (2.2.1) : Prediction of Coiled Coil Regions in Proteins              ProSiteProfiles (2018_02) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them                        SMART (7.1) : SMART allows the identification and analysis of domain architectures based on Hidden Markov Models or HMMs                          CDD (3.16) : Prediction of CDD domains in Proteins                       PRINTS (42.0) : A fingerprint is a group of conserved motifs used to characterise a protein family              ProSitePatterns (2018_02) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them                         Pfam (31.0) : A large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs)                       ProDom (2006.1) : ProDom is a comprehensive set of protein domain families automatically generated from the UniProt Knowledge Database.                   MobiDBLite (1.5) : Prediction of disordered domains Regions in Proteins                        PIRSF (3.02) : The PIRSF concept is being used as a guiding principle to provide comprehensive and non-overlapping clustering of UniProtKB sequences into a hierarchical order to reflect their evolutionary relationships.Deactivated analyses:                      Phobius (1.01) : Analysis Phobius is deactivated, because the resources expected at the following paths do not exist: bin/phobius/1.01/phobius.pl        SignalP_GRAM_POSITIVE (4.1) : Analysis SignalP_GRAM_POSITIVE is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp                        TMHMM (2.0c) : Analysis TMHMM is deactivated, because the resources expected at the following paths do not exist: bin/tmhmm/2.0c/decodeanhmm, data/tmhmm/2.0c/TMHMM2.0c.model        SignalP_GRAM_NEGATIVE (4.1) : Analysis SignalP_GRAM_NEGATIVE is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp                      PANTHER (12.0) : Analysis Panther is deactivated, because the resources expected at the following paths do not exist: data/panther/12.0/panther.hmm, data/panther/12.0/names.tab                  SignalP_EUK (4.1) : Analysis SignalP_EUK is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp

所以，我们需要额外安装Panther数据库。

Panther数据需要下载到[InterProScan5 home]/data/下：该数据大小：10.5G

cd interproscan-5.30-69.0/data/wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/data/panther-data-12.0.tar.gz# 查看是否下载完整wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/data/panther-data-12.0.tar.gz.md5md5sum -c panther-data-12.0.tar.gz.md5# 应当返回如下文字 *panther-data-12.0.tar.gz: OK*# 否则需要重新下载

安装Panther数据[5]

tar -pxvzf panther-data-12.0.tar.gz# 好了，安装完毕# 这个文件太大了，解压完还是删掉吧。。。太占用空间。rm panther-data-12.0.tar.gz

假如你希望把Panther安装到别的地方。则需要在加压文件后，编辑你InterProScan软件目录下的interproscan.properties文件，并设置你的Pather的对应位置：

panther.models.dir=PATH_TO/panther/12.0/panther.hmm.path=PATH_TO/panther/12.0/panther.hmm

好。我们再来查看一下:

interproscan.sh

可以看到PANHTER已经在Available analyses一栏了。

Available analyses:                      TIGRFAM (15.0) : TIGRFAMs are protein families based on Hidden Markov Models or HMMs                         SFLD (3) : SFLDs are protein families based on Hidden Markov Models or HMMs                  SUPERFAMILY (1.75) : SUPERFAMILY is a database of structural and functional annotation for all proteins and genomes.                      PANTHER (12.0) : The PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System is a unique resource that classifies genes by their functions, using published scientific experimental evidence and evolutionary relationships to predict function even in the absence of direct experimental evidence.                       Gene3D (4.2.0) : Structural assignment for whole genes and genomes using the CATH domain structure database                        Hamap (2018_03) : High-quality Automated and Manual Annotation of Microbial Proteomes                        Coils (2.2.1) : Prediction of Coiled Coil Regions in Proteins              ProSiteProfiles (2018_02) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them                        SMART (7.1) : SMART allows the identification and analysis of domain architectures based on Hidden Markov Models or HMMs                          CDD (3.16) : Prediction of CDD domains in Proteins                       PRINTS (42.0) : A fingerprint is a group of conserved motifs used to characterise a protein family              ProSitePatterns (2018_02) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them                         Pfam (31.0) : A large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs)                       ProDom (2006.1) : ProDom is a comprehensive set of protein domain families automatically generated from the UniProt Knowledge Database.                   MobiDBLite (1.5) : Prediction of disordered domains Regions in Proteins                        PIRSF (3.02) : The PIRSF concept is being used as a guiding principle to provide comprehensive and non-overlapping clustering of UniProtKB sequences into a hierarchical order to reflect their evolutionary relationships.Deactivated analyses:                  SignalP_EUK (4.1) : Analysis SignalP_EUK is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp        SignalP_GRAM_NEGATIVE (4.1) : Analysis SignalP_GRAM_NEGATIVE is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp                        TMHMM (2.0c) : Analysis TMHMM is deactivated, because the resources expected at the following paths do not exist: bin/tmhmm/2.0c/decodeanhmm, data/tmhmm/2.0c/TMHMM2.0c.model                      Phobius (1.01) : Analysis Phobius is deactivated, because the resources expected at the following paths do not exist: bin/phobius/1.01/phobius.pl        SignalP_GRAM_POSITIVE (4.1) : Analysis SignalP_GRAM_POSITIVE is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp

【可选】使用 Local Pre-calculated Match Lookup service

简介

InterProScan match lookup service存储了预先计算好InterPro数据库里的InterProScan结果。当InterProScan提交一个已知序列，则会从Lookup service里马上返回和宝刀这个结果，从而减轻计算负担提高性能。对于不在Lookup service里的序列，InterProScan则会根据用户要求进行计算。

如果想要这个功能：

你的服务器能够连接到外网 http://www.ebi.ac.uk 上。
如果服务器不能连接外网，可以下载一个本地的 InterProScan 5 lookup service，具体方法安装参见InterProScan 5 user’s guide · Local Lookup Service

如果不想要这个功能：

使用-dp在命令行关闭该功能
或者编辑interproscan.properties，在下边行开头加 # 将该行注释掉，也可以将该行直接删除。
precalculated.match.lookup.service.url=http://www.ebi.ac.uk/interpro/match-lookup

软件使用

运行示例：使用数据库Panther进行GO 注释

interproscan.sh -appl PANTHER -f TSV,GFF3 -t n -i transcript.fa -cpu 16 -b transcript -goterms -iprlookup -pa -dp

-appl是指定要执行的数据库分析，选的越多，运行会越慢。洲更在《全基因组基因功能注释》表示，Pfam就够了。这里我用的是PANTHER。
-f 输出的文件格式，可以用“,”隔开来指定输出多种格式。支持格式有TSV, XML, JSON, GFF3, HTML和SVG。蛋白默认的是TSV, XML和GFF3。核苷酸默认的是XML和GFF3。
-t 指定输入序列的类型，默认是氨基酸序列，-t n表示核苷酸序列
-i fasta文件的路径。
-cup 使用核心数。根据自己电脑配置情况设定。
-b 基本的输出文件名称。文件后缀软件会自动加上。默认会使用输入文件的路径/名字。
-goterms 打开查找对应的GO注释
-iprlookup 从匹配数据库提供mapping到他们集成的InterPro条目，在输出文件TSV和GFF3中生成对应的InterPro注释
-pa pathway注释
-dp 关闭precalculated match lookup service

参考资料：

Output Formats ：https://github.com/ebi-pf-team/interproscan/wiki/OutputFormats
InterProScan 5 user's guide · Installation Requirements ：https://github.com/ebi-pf-team/interproscan/wiki/InstallationRequirements
InterProScan 5 user's guide · Running InterProScan 5 ：https://github.com/ebi-pf-team/interproscan/wiki/HowToRun
ABOUT PANTHER ：http://www.pantherdb.org/about.jsp
InterProScan 5 user's guide · How to download a copy ：https://github.com/ebi-pf-team/interproscan/wiki/HowToDownload

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。