打开APP
userphoto
未登录

开通VIP,畅享免费电子书等14项超值服

开通VIP
2019年5月bioRxiv生信好文速览

到上个月,距生信人推出月度的bioRxiv生信好文速览栏目已经整整一年了。大约一年前,我们曾在bioRxiv速览”中对Natureworld view板块刊发的来自伦敦的科学记者Tom Sheldon的一篇文章有过报道,该文作者表示,因为预印本(preprint)未经同行评议,所以与正式发表的文章相比而言可能包含更多错误,而这些错误可能通过预印本被传播、放大,由此Sheldon大声疾呼学界采取措施加强对预印本发布的限制。

一年过去了,上个月,Nature官方为预印本“正名”了。515号,Nature杂志以Editorial的形式刊文,正式表示了Nature及其旗下杂志对于预印本的支持!

 

实际上,Nature早在1997年就对预印本有过点评,不过当时的预印本主要实在物理学界罢了。而现今,Nature的编辑认为,是时候表示对预印本,这样一种集发现优权先宣示、接受同行意见、快速展示研究进展于一体的文体表示支持的时候了。

By making early research findings accessible quickly and easily, preprints allow researchers to claim priority of discovery, receive community input and demonstrate evidence of progress for funders and others.

文章作者还表示,这一次Nature对以下两个以前有些模棱两个的问题加以更新。第一,允许作者对预印本文章选择版权,且不会影响审稿,但需注意,版权选择可能会限制研究成果的分享和传播。第二,作者可以通过媒体报道预印本的研究成果,但与此同时也应强调这些结果并未经过同行评议。Nature的影响力毋庸置疑,当然,也不应忘记当年老牌经典杂志Genetics大概是第一个公开声明支持预印本的生物类学术期刊。

五年多过去了,预印本的队伍——不论是使用者还是服务器——在迅速壮大蓬勃发展,这一点从上月刊于elife上的对bioRxiv自成立以来发布的37000余篇preprints的调查报告中可见一斑【1,2】。预印本发展到今天,得益于无数先驱者们的努力,当然也离不开批评者们的声音。它的未来需要学术圈的共同努力。

1. Bioinformatics】终于来了:谷歌携深度学习进军基因功能注释,号称大幅提升预测效果和速度

Using Deep Learning to Annotate the Protein Universe

Understanding the relationship between amino acid sequence and protein function is a long-standing problem in molecular biology with far-reaching scientific implications. Despite six decades of progress, state-of-the-art techniques cannot annotate 1/3 of microbial protein sequences, hampering our ability to exploit sequences collected from diverse organisms. To address this, we report a deep learning model that learns the relationship between unaligned amino acid sequences and their functional classification across all 17929 families of the Pfam database. Using the Pfam seed sequences we establish a rigorous benchmark assessment and find a dilated convolutional model that reduces the error of both BLASTp and pHMMs by a factor of nine. Using 80% of the full Pfam database we train a protein family predictor that is more accurate and over 200 times faster than BLASTp, while learning sequence features it was not trained on such as structural disorder and transmembrane helices. Our model co-locates sequences from unseen families in embedding space, allowing sequences from novel families to be accurately annotated. These results suggest deep learning models will be a core component of future protein function prediction tools.

BTW:本文发布后立即在网上引起广泛关注,也包括不少质疑声音。来自丹麦哥本哈根大学的Lars Juhl Jensen教授表示,谷歌团队在测试集选取时忽略了属于同一家族的蛋白在进化上的关联:

HMMER作者Sean Eddy也表达了相似观点,此外还表示文章里对自己的软件在关于速度的描述有严重偏差:

2. Bioinformatics】针对大基因组的从头组装软件Ra

Yet another de novo genome assemblerCC-BY-NC 4.0

Advances in sequencing technologies have pushed the limits of genome assemblies beyond imagination. The sheer amount of long read data that is being generated enables the assembly for even the largest and most complex organism for which efficient algorithms are needed. We present a new tool, called Ra, for de novo genome assembly of long uncorrected reads. It is a fast and memory friendly assembler based on sequence classification and assembly graphs, developed with large genomes in mind. It is freely available at https://github.com/lbcbsci/ra.

3. Bioinformatics】普林斯顿大学John StoreyRNA-seq差异表达实验达到statistical power测序深度需达到多少?

Determining sufficient sequencing depth in RNA-Seq differential expression studiesCC-BY-ND 4.0

RNA-Seq studies require a sufficient read depth to detect biologically important genes. Sequencing below this threshold will reduce statistical power while sequencing above will provide only marginal improvements in power and incur unnecessary sequencing costs. Although existing methodologies can help assess whether there is sufficient read depth, they are unable to guide how many additional reads should be sequenced to reach this threshold. We provide a new method called superSeq that models the relationship between statistical power and read depth. We apply the superSeq framework to 393 RNA-Seq experiments (1,021 total contrasts) in the Expression Atlas and find the model accurately predicts the increase in statistical power gained by increasing the read depth. Based on our analysis, we find that most published studies (> 70%) are undersequenced, i.e., their statistical power can be improved by increasing the sequencing read depth. In addition, the extent of saturation is highly dependent on statistical methodology: only 9.5%, 29.5%, and 26.6% of contrasts are saturated when using DESeq2, edgeR, and limma, respectively. Finally, we also find that there is no clear minimum per-transcript read depth to guarantee saturation for an entire technology. Therefore, our framework not only delineates key differences among methods and their impact on determining saturation, but will also be needed even as technology improves and the read depth of experiments increases. Researchers can thus use superSeq to calculate the read depth to achieve required statistical power while avoiding unnecessary sequencing costs.

4. Evolution】中山大学施苏华团队:以红树为例,基因组中有多少基因可以在物种间自由交换?

Genes and the species concept - How much of the genomes can be exchanged?CC-BY-NC-ND 4.0

In the biological species concept, much of the genomes cannot be exchanged between species1,2. In the modern genic view, species are distinct as long as genes that delineate the morphological, ecological and reproductive differences remain distinct2. The rest (or the bulk) of the genomes should be freely interchangeable. The core of the species concept therefore demands finding out the full potential of introgressions between species. In a survey of two closely related mangrove species (Rhizophora mucronata and R. stylosa) on the coasts of the western Pacific and Indian oceans, we found that the genomes are well delineated in allopatry, echoing their morphological and ecological divergence. The two species are sympatric/parapatric in the Daintree River area of northeastern Australia. In sympatry, their genomes harbor 7,700 and 3,100 introgression blocks, respectively, with each block averaging about 3-4 Kb. These fine-grained and strongly-penetrant introgressions suggest that each species must have evolved many differentially-adaptive (and, hence, non-introgressable) genes that contribute to speciation. We identify 30 such genes, seven of which are about flower development, within small genomic islets with a mean size of 1.4 Kb. In sympatry, the species-specific genomic islets account for only a small fraction (< 15%)="" of="" the="" genomes="" while="" the="" rest="" appears="">

5. Genomics】牛油果基因组为被子植物早期演化提供新线索

The Avocado Genome Informs Deep Angiosperm Phylogeny, Highlights Introgressive Hybridization, and Reveals Pathogen-Influenced Gene Space AdaptationCC-BY-NC-ND 4.0

The avocado, Persea americana, is a fruit crop of immense importance to Mexican agriculture with an increasing demand worldwide. Avocado lies in the anciently-diverged magnoliid clade of angiosperms, which has a controversial phylogenetic position relative to eudicots and monocots. We sequenced the nuclear genomes of the Mexican avocado race, P. americana var. drymifolia, and the most commercially popular hybrid cultivar, Hass, and anchored the latter to chromosomes using a genetic map. Resequencing of Guatemalan and West Indian varieties revealed that ~39% of the Hass genome represents Guatemalan source regions introgressed into a Mexican race background. Some introgressed blocks are extremely large, consistent with the recent origin of the cultivar. The avocado lineage experienced two lineage-specific polyploidy events during its evolutionary history. Although gene-tree/species-tree phylogenomic results are inconclusive, syntenic ortholog distances to other species place avocado as sister to the enormous monocot and eudicot lineages combined. Duplicate genes descending from polyploidy augmented the transcription factor diversity of avocado, while tandem duplicates enhanced the secondary metabolism of the species. Phenylpropanoid biosynthesis, known to be elicited by Colletotrichum (anthracnose) pathogen infection in avocado, is one enriched function among tandems. Furthermore, transcriptome data show that tandem duplicates are significantly up- and down-regulated in response to anthracnose infection, whereas polyploid duplicates are not, supporting the general view that collections of tandem duplicates contribute evolutionarily recent 'tuning knobs' in the genome adaptive landscapes of given species.

6. Genomics】哥伦比亚大学学者:多基因风险评分的使用要小心

Variable prediction accuracy of polygenic scores within an ancestry groupCC-BY 4.0

Fields as diverse as human genetics and sociology are increasingly using polygenic scores based on genome-wide association studies (GWAS) for phenotypic prediction. However, recent work has shown that polygenic scores have limited portability across groups of different genetic ancestries, restricting the contexts in which they can be used reliably and potentially creating serious inequities in future clinical applications. Using the UK Biobank data, we demonstrate that even within a single ancestry group, the prediction accuracy of polygenic scores depends on characteristics such as the age or sex composition of the individuals in which the GWAS and the prediction were conducted, and on the GWAS study design. Our findings highlight both the complexities of interpreting polygenic scores and underappreciated obstacles to their broad use.

7. Bioinformatics3D RNA-seq:为生物学家量身打造的“傻瓜式”RNA-seq分析平台

3D RNA-seq - a powerful and flexible tool for rapid and accurate differential expression and alternative splicing analysis of RNA-seq data for biologists

RNA-sequencing (RNA-seq) analysis of gene expression and alternative splicing should be routine and robust but is often a bottleneck for biologists because of different and complex analysis programs and reliance on skilled bioinformaticians to perform the analysis. To overcome these issues, we have developed the '3D RNA-seq' App, an R shiny App which provides an easy-to-use, flexible and powerful tool for the three-way differential analysis: Differential Expression (DE), Differential Alternative Splicing (DAS) and Differential Transcript Usage (DTU) of RNA-seq data. The full analysis is extremely rapid and can be done within hours. The program integrates Limma, a state-of-the-art, highly rated differential expression analysis tool and adopts best practice for RNA-seq analysis. It runs the analysis through a user-friendly graphical interface, can handle complex experimental designs, allows user setting of statistical parameters, visualizes the results through graphics and tables, and generates publication quality figures such as heat-maps, expression profiles and GO enrichment plots. The utility of 3D RNA-seq is illustrated by analysis of Arabidopsis and mouse RNA-seq data. The program is designed to be run by biologists with minimal bioinformatics experience (or by bioinformaticians) allowing lab scientists to take control of the analysis of their RNA-seq data.

8. TranscriptomicsBroad研究所Joshua Levin对单细胞RNA-seq方法的系统比较

Systematic comparative analysis of single cell RNA-sequencing methodsCC-BY 4.0

A multitude of single-cell RNA sequencing methods have been developed in recent years, with dramatic advances in scale and power, and enabling major discoveries and large scale cell mapping efforts. However, these methods have not been systematically and comprehensively benchmarked. Here, we directly compare seven methods for single cell and/or single nucleus profiling from three types of samples – cell lines, peripheral blood mononuclear cells and brain tissue – generating 36 libraries in six separate experiments in a single center. To analyze these datasets, we developed and applied scumi, a flexible computational pipeline that can be used for any scRNA-seq method. We evaluated the methods for both basic performance and for their ability to recover known biological information in the samples. Our study will help guide experiments with the methods in this study as well as serve as a benchmark for future studies and for computational algorithm development.

 

9. Metagenomics】加州大学圣地亚哥分校Pevzner: 长读段repeat graph法宏基因组组装工具metaFlye

metaFlye: scalable long-read metagenome assembly using repeat graphsCC-BY-NC-ND 4.0

Long-read sequencing technologies substantially improved assemblies of many isolate bacterial genomes as compared to fragmented assemblies produced with short-read technologies. However, assembling complex metagenomic datasets remains a challenge even for the state-of-the-art long-read assemblers. To address this gap, we present the metaFlye assembler and demonstrate that it generates highly contiguous and accurate metagenome assemblies. In contrast to short-read metagenomics assemblers that typically fail to reconstruct full-length 16S RNA genes, metaFlye captures many 16S RNA genes within long contigs, thus providing new opportunities for analyzing the microbial “dark matter of life”. We also demonstrate that long-read metagenome assemblers significantly improve full-length plasmid and virus reconstruction as compared to short-read assemblers and reveal many novel plasmids and viruses.

10. Bioinformatics】瑞士巴塞尔大学Milan Malinsky:从VCF快速计算D-statistics的新工具Dsuite

Dsuite - fast D-statistics and related admixture evidence from VCF filesCC-BY 4.0

Summary The D-statistic, also known as the ABBA-BABA statistic, and related statistics are commonly used to assess evidence of gene flow between populations or closely related species. While the calculations are not computationally intensive, currently available implementations require custom file formats and are impractical to evaluate all gene flow hypotheses across datasets that include many populations or species. Dsuite is a fast C++ implementation, allowing genome scale calculations of the D-statistic across all combinations of tens or even hundreds of populations or species directly from a variant call format (VCF) file. Furthermore, the program can estimate the admixture fraction and provide evidence of whether introgression is confined to specific loci. Thus Dsuite facilitates assessment of gene flow across large genomic datasets. Availability and implementation Source code and documentation are available at: https://github.com/millanek/Dsuite

引文

1. Abdill R., Blekhman R., Meta-Research: Tracking the popularity and outcomes of all bioRxiv preprints, 2019; 8:e45133

2. 生信人:BioRxiv五年文章大盘点——生信傲视群雄

本站仅提供存储服务,所有内容均由用户发布,如发现有害或侵权内容,请点击举报
打开APP,阅读全文并永久保存 查看更多类似文章
猜你喜欢
类似文章
【热】打开小程序,算一算2024你的财运
Cell Systems:回顾经典,2016年精选文章大盘点
谈一谈中国春基因转录水平上的证据
转录组学在抑菌机制中的应用研究进展
NGS测序专业词汇解析!
高通量测序临床应用中数据质量控制和分析若干问题的探讨
一份超详细的DNA甲基化技术指南·上篇
更多类似文章 >>
生活服务
热点新闻
分享 收藏 导长图 关注 下载文章
绑定账号成功
后续可登录账号畅享VIP特权!
如果VIP功能使用有故障,
可点击这里联系客服!

联系客服