Meet our team!

Serghei Mangul, Ph.D.
PI, Mangul Lab
mangul@usc.edu • View Serghei’s papers
Mangul Lab papers authored by Serghei
Mandric, Igor; Rotman, Jeremy; Yang, Harry Taegyun; Strauli, Nicolas; Montoya, Dennis; Lay, Will Van Der; Ronas, Jiem R; Statz, Benjamin; Yao, Douglas; Petrova, Velislava; Zelikovsky, Alex; Spreafico, Roberto; Shifman, Sagiv; Zaitlen, Noah; Rossetti, Maura; Ansel, Mark K; Eskin, Eleazar; Mangul, Serghei Profiling immunoglobulin repertoires across multiple human tissues using RNA sequencing Journal Article Nature Communications, 11 (3126), 2020. Abstract | Links | BibTeX | Altmetric @article{mangul2016profiling, title = {Profiling immunoglobulin repertoires across multiple human tissues using RNA sequencing}, author = {Igor Mandric and Jeremy Rotman and Harry Taegyun Yang and Nicolas Strauli and Dennis Montoya and Will Van Der Lay and Jiem R Ronas and Benjamin Statz and Douglas Yao and Velislava Petrova and Alex Zelikovsky and Roberto Spreafico and Sagiv Shifman and Noah Zaitlen and Maura Rossetti and K. Mark Ansel and Eleazar Eskin and Serghei Mangul}, url = {https://doi.org/10.1038/s41467-020-16857-7}, doi = {10.1038/s41467-020-16857-7}, year = {2020}, date = {2020-06-19}, journal = {Nature Communications}, volume = {11}, number = {3126}, publisher = {Nature Publications}, abstract = {Profiling immunoglobulin (Ig) receptor repertoires with specialized assays can be cost-ineffective and time-consuming. Here we report ImReP, a computational method for rapid and accurate profiling of the Ig repertoire, including the complementary-determining region 3 (CDR3), using regular RNA sequencing data such as those from 8,555 samples across 53 tissues types from 544 individuals in the Genotype-Tissue Expression (GTEx v6) project. Using ImReP and GTEx v6 data, we generate a collection of 3.6 million Ig sequences, termed the atlas of immunoglobulin repertoires (TAIR), across a broad range of tissue types that often do not have reported Ig repertoires information. Moreover, the flow of Ig clonotypes and inter-tissue repertoire similarities across immune-related tissues are also evaluated. In summary, TAIR is one of the largest collections of CDR3 sequences and tissue types, and should serve as an important resource for studying immunological diseases.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Profiling immunoglobulin (Ig) receptor repertoires with specialized assays can be cost-ineffective and time-consuming. Here we report ImReP, a computational method for rapid and accurate profiling of the Ig repertoire, including the complementary-determining region 3 (CDR3), using regular RNA sequencing data such as those from 8,555 samples across 53 tissues types from 544 individuals in the Genotype-Tissue Expression (GTEx v6) project. Using ImReP and GTEx v6 data, we generate a collection of 3.6 million Ig sequences, termed the atlas of immunoglobulin repertoires (TAIR), across a broad range of tissue types that often do not have reported Ig repertoires information. Moreover, the flow of Ig clonotypes and inter-tissue repertoire similarities across immune-related tissues are also evaluated. In summary, TAIR is one of the largest collections of CDR3 sequences and tissue types, and should serve as an important resource for studying immunological diseases. |
Oh, David Y; Kwek, Serena S; Raju, Siddharth S; Li, Tony; McCarthy, Elizabeth; Chow, Eric; Aran, Dvir; Ilano, Arielle; Pai, Chien-Chun Steven; Rancan, Chiara; Allaire, Kathryn; Burra, Arun; Sun, Yang; Spitzer, Matthew H; Mangul, Serghei; Porten, Sima; Meng, Maxwell V; Friedlander, Terence W; Ye, Chun Jimmie; Fong, Lawrence Intratumoral CD4+ T Cells Mediate Anti-tumor Cytotoxicity in Human Bladder Cancer Journal Article Cell, 2020. Abstract | Links | BibTeX | Altmetric @article{Y.Oh2020, title = {Intratumoral CD4+ T Cells Mediate Anti-tumor Cytotoxicity in Human Bladder Cancer}, author = {David Y. Oh and Serena S. Kwek and Siddharth S. Raju and Tony Li and Elizabeth McCarthy and Eric Chow and Dvir Aran and Arielle Ilano and Chien-Chun Steven Pai and Chiara Rancan and Kathryn Allaire and Arun Burra and Yang Sun and Matthew H. Spitzer and Serghei Mangul and Sima Porten and Maxwell V. Meng and Terence W. Friedlander and Chun Jimmie Ye and Lawrence Fong}, url = {https://doi.org/10.1016/j.cell.2020.05.017}, doi = {10.1016/j.cell.2020.05.017}, year = {2020}, date = {2020-06-03}, journal = {Cell}, abstract = {Responses to anti-PD-1 immunotherapy occur but are infrequent in bladder cancer. The specific T cells that mediate tumor rejection are unknown. T cells from human bladder tumors and non-malignant tissue were assessed with single-cell RNA and paired T cell receptor (TCR) sequencing of 30,604 T cells from 7 patients. We find that the states and repertoires of CD8+ T cells are not distinct in tumors compared with non-malignant tissues. In contrast, single-cell analysis of CD4+ T cells demonstrates several tumor-specific states, including multiple distinct states of regulatory T cells. Surprisingly, we also find multiple cytotoxic CD4+ T cell states that are clonally expanded. These CD4+ T cells can kill autologous tumors in an MHC class II-dependent fashion and are suppressed by regulatory T cells. Further, a gene signature of cytotoxic CD4+ T cells in tumors predicts a clinical response in 244 metastatic bladder cancer patients treated with anti-PD-L1.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Responses to anti-PD-1 immunotherapy occur but are infrequent in bladder cancer. The specific T cells that mediate tumor rejection are unknown. T cells from human bladder tumors and non-malignant tissue were assessed with single-cell RNA and paired T cell receptor (TCR) sequencing of 30,604 T cells from 7 patients. We find that the states and repertoires of CD8+ T cells are not distinct in tumors compared with non-malignant tissues. In contrast, single-cell analysis of CD4+ T cells demonstrates several tumor-specific states, including multiple distinct states of regulatory T cells. Surprisingly, we also find multiple cytotoxic CD4+ T cell states that are clonally expanded. These CD4+ T cells can kill autologous tumors in an MHC class II-dependent fashion and are suppressed by regulatory T cells. Further, a gene signature of cytotoxic CD4+ T cells in tumors predicts a clinical response in 244 metastatic bladder cancer patients treated with anti-PD-L1. |
Brito, Jaqueline J; Li, Jun; Moore, Jason H; Greene, Casey S; Nogoy, Nicole A; Garmire, Lana X; Mangul, Serghei Recommendations to enhance rigor and reproducibility in biomedical research Journal Article GigaScience, 9 (6), pp. giaa056, 2020. Abstract | Links | BibTeX | Altmetric @article{Brito2020, title = {Recommendations to enhance rigor and reproducibility in biomedical research}, author = {Jaqueline J Brito and Jun Li and Jason H Moore and Casey S Greene and Nicole A Nogoy and Lana X Garmire and Serghei Mangul}, url = {https://doi.org/10.1093/gigascience/giaa056}, doi = {10.1093/gigascience/giaa056}, year = {2020}, date = {2020-06-01}, journal = {GigaScience}, volume = {9}, number = {6}, pages = {giaa056}, abstract = {Biomedical research depends increasingly on computational tools, but mechanisms ensuring open data, open software, and reproducibility are variably enforced by academic institutions, funders, and publishers. Publications may present software for which source code or documentation are or become unavailable; this compromises the role of peer review in evaluating technical strength and scientific contribution. Incomplete ancillary information for an academic software package may bias or limit subsequent work. We provide 8 recommendations to improve reproducibility, transparency, and rigor in computational biology—precisely the values that should be emphasized in life science curricula. Our recommendations for improving software availability, usability, and archival stability aim to foster a sustainable data science ecosystem in life science research.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Biomedical research depends increasingly on computational tools, but mechanisms ensuring open data, open software, and reproducibility are variably enforced by academic institutions, funders, and publishers. Publications may present software for which source code or documentation are or become unavailable; this compromises the role of peer review in evaluating technical strength and scientific contribution. Incomplete ancillary information for an academic software package may bias or limit subsequent work. We provide 8 recommendations to improve reproducibility, transparency, and rigor in computational biology—precisely the values that should be emphasized in life science curricula. Our recommendations for improving software availability, usability, and archival stability aim to foster a sustainable data science ecosystem in life science research. |
Sarwal, Varuni; Niehus, Sebastian; Ayyala, Ram; Chang, Sei; Lu, Angela; Darci-Maher, Nicholas; Littman, Russell Jared; Wesel, Emily; Castellanos, Jacqueline; Chikka, Rahul; Distler, Margaret G; Eskin, Eleazar; Flint, Jonathan; Mangul, Serghei A comprehensive benchmarking of WGS-based structural variant callers Journal Article bioRxiv, 2020. Abstract | Links | BibTeX | Altmetric @article{Sarwal2020, title = {A comprehensive benchmarking of WGS-based structural variant callers}, author = {Varuni Sarwal and Sebastian Niehus and Ram Ayyala and Sei Chang and Angela Lu and Nicholas Darci-Maher and Russell Jared Littman and Emily Wesel and Jacqueline Castellanos and Rahul Chikka and Margaret G Distler and Eleazar Eskin and Jonathan Flint and Serghei Mangul}, url = {https://www.biorxiv.org/content/10.1101/2020.04.16.045120v1}, doi = {10.1101/2020.04.16.045120}, year = {2020}, date = {2020-04-18}, journal = {bioRxiv}, abstract = {Advances in whole genome sequencing promise to enable the accurate and comprehensive structural variant (SV) discovery. Dissecting SVs from whole genome sequencing (WGS) data presents a substantial number of challenges and a plethora of SV-detection methods have been developed. Currently, there is a paucity of evidence which investigators can use to select appropriate SV-detection tools. In this paper, we evaluated the performance of SV-detection tools using a comprehensive PCR-confirmed gold standard set of SVs. In contrast to the previous benchmarking studies, our gold standard dataset included a complete set of SVs allowing us to report both precision and sensitivity rates of SV-detection methods. Our study investigates the ability of the methods to detect deletions, thus providing an optimistic estimate of SV detection performance, as the SV-detection methods that fail to detect deletions are likely to miss more complex SVs. We found that SV-detection tools varied widely in their performance, with several methods providing a good balance between sensitivity and precision. Additionally, we have determined the SV callers best suited for low and ultra-low pass sequencing data.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Advances in whole genome sequencing promise to enable the accurate and comprehensive structural variant (SV) discovery. Dissecting SVs from whole genome sequencing (WGS) data presents a substantial number of challenges and a plethora of SV-detection methods have been developed. Currently, there is a paucity of evidence which investigators can use to select appropriate SV-detection tools. In this paper, we evaluated the performance of SV-detection tools using a comprehensive PCR-confirmed gold standard set of SVs. In contrast to the previous benchmarking studies, our gold standard dataset included a complete set of SVs allowing us to report both precision and sensitivity rates of SV-detection methods. Our study investigates the ability of the methods to detect deletions, thus providing an optimistic estimate of SV detection performance, as the SV-detection methods that fail to detect deletions are likely to miss more complex SVs. We found that SV-detection tools varied widely in their performance, with several methods providing a good balance between sensitivity and precision. Additionally, we have determined the SV callers best suited for low and ultra-low pass sequencing data. |
Loeffler, Caitlin; Karlsberg, Aaron; Martin, Lana S; Eskin, Eleazar; Koslicki, David; Mangul, Serghei Improving the usability and comprehensiveness of microbial databases Journal Article BMC Biology, 18 (37), 2020. Abstract | Links | BibTeX | Altmetric @article{microbial2020, title = {Improving the usability and comprehensiveness of microbial databases}, author = {Caitlin Loeffler and Aaron Karlsberg and Lana S Martin and Eleazar Eskin and David Koslicki and Serghei Mangul}, url = {https://doi.org/10.1186/s12915-020-0756-z}, doi = {10.1186/s12915-020-0756-z}, year = {2020}, date = {2020-04-07}, journal = {BMC Biology}, volume = {18}, number = {37}, abstract = {Metagenomics studies leverage genomic reference databases to generate discoveries in basic science and translational research. However, current microbial studies use disparate reference databases that lack consistent standards of specimen inclusion, data preparation, taxon labelling and accessibility, hindering their quality and comprehensiveness, and calling for the establishment of recommendations for reference genome database assembly. Here, we analyze existing fungal and bacterial databases and discuss guidelines for the development of a master reference database that promises to improve the quality and quantity of omics research.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Metagenomics studies leverage genomic reference databases to generate discoveries in basic science and translational research. However, current microbial studies use disparate reference databases that lack consistent standards of specimen inclusion, data preparation, taxon labelling and accessibility, hindering their quality and comprehensiveness, and calling for the establishment of recommendations for reference genome database assembly. Here, we analyze existing fungal and bacterial databases and discuss guidelines for the development of a master reference database that promises to improve the quality and quantity of omics research. |
Mitchell, Keith; Brito, Jaqueline J; Mandric, Igor; Wu, Qiaozhen; Knyazev, Sergey; Chang, Sei; Martin, Lana S; Karlsberg, Aaron; Gerasimov, Ekaterina; Littman, Russell Jared; Hill, Brian L; Wu, Nicholas C; Yang, Harry Taegyun; Hsieh, Kevin; Chen, Linus; Littman, Eli; Shabani, Taylor; Shabanets, German; Yao, Douglas; Sun, Ren; Schroeder, Jan; Eskin, Eleazar; Zelikovsky, Alex; Skums, Pavel; Pop, Mihai; Mangul, Serghei Benchmarking of computational error-correction methods for next-generation sequencing data Journal Article Genome Biology, 21 (71), 2020. Abstract | Links | BibTeX | Altmetric @article{mitchell2019benchmarking, title = {Benchmarking of computational error-correction methods for next-generation sequencing data}, author = {Keith Mitchell and Jaqueline J Brito and Igor Mandric and Qiaozhen Wu and Sergey Knyazev and Sei Chang and Lana S Martin and Aaron Karlsberg and Ekaterina Gerasimov and Russell Jared Littman and Brian L Hill and Nicholas C Wu and Harry Taegyun Yang and Kevin Hsieh and Linus Chen and Eli Littman and Taylor Shabani and German Shabanets and Douglas Yao and Ren Sun and Jan Schroeder and Eleazar Eskin and Alex Zelikovsky and Pavel Skums and Mihai Pop and Serghei Mangul}, url = {https://doi.org/10.1186/s13059-020-01988-3}, doi = {10.1186/s13059-020-01988-3}, year = {2020}, date = {2020-03-17}, journal = {Genome Biology}, volume = {21}, number = {71}, publisher = {Cold Spring Harbor Laboratory}, abstract = {Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown. |
Alser, Mohammed; Rotman, Jeremy; Taraszka, Kodi; Shi, Huwenbo; Baykal, Pelin Icer; Yang, Harry Taegyun; Xue, Victor; Knyazev, Sergey; Singer, Benjamin D; Balliu, Brunilda; Koslicki, David; Skums, Pavel; Zelikovsky, Alex; Alkan, Can; Mutlu, Onur; Mangul, Serghei Technology dictates algorithms: Recent developments in read alignment Journal Article arXiv, 2020. Abstract | Links | BibTeX | Altmetric @article{Alser2020, title = {Technology dictates algorithms: Recent developments in read alignment}, author = {Mohammed Alser and Jeremy Rotman and Kodi Taraszka and Huwenbo Shi and Pelin Icer Baykal and Harry Taegyun Yang and Victor Xue and Sergey Knyazev and Benjamin D Singer and Brunilda Balliu and David Koslicki and Pavel Skums and Alex Zelikovsky and Can Alkan and Onur Mutlu and Serghei Mangul}, url = {https://arxiv.org/abs/2003.00110}, doi = {2003.00110}, year = {2020}, date = {2020-02-28}, journal = {arXiv}, abstract = {Massively parallel sequencing techniques have revolutionized biological and medical sciences by providing unprecedented insight into the genomes of humans, animals, and microbes. Aligned reads are essential for answering important biological questions, such as detecting mutations driving various human diseases and complex traits as well as identifying species present in metagenomic samples. The read alignment problem is extremely challenging due to the large size of analyzed datasets and numerous technological limitations of sequencing platforms, and researchers have developed novel bioinformatics algorithms to tackle these difficulties. Our review provides a survey of algorithmic foundations and methodologies across alignment methods for both short and long reads. We provide rigorous experimental evaluation of 11 read aligners to demonstrate the effect of these underlying algorithms on speed and efficiency of read aligners. We separately discuss how longer read lengths produce unique advantages and limitations to read alignment techniques. We also discuss how general alignment algorithms have been tailored to the specific needs of various domains in biology, including whole transcriptome, adaptive immune repertoire, and human microbiome studies.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Massively parallel sequencing techniques have revolutionized biological and medical sciences by providing unprecedented insight into the genomes of humans, animals, and microbes. Aligned reads are essential for answering important biological questions, such as detecting mutations driving various human diseases and complex traits as well as identifying species present in metagenomic samples. The read alignment problem is extremely challenging due to the large size of analyzed datasets and numerous technological limitations of sequencing platforms, and researchers have developed novel bioinformatics algorithms to tackle these difficulties. Our review provides a survey of algorithmic foundations and methodologies across alignment methods for both short and long reads. We provide rigorous experimental evaluation of 11 read aligners to demonstrate the effect of these underlying algorithms on speed and efficiency of read aligners. We separately discuss how longer read lengths produce unique advantages and limitations to read alignment techniques. We also discuss how general alignment algorithms have been tailored to the specific needs of various domains in biology, including whole transcriptome, adaptive immune repertoire, and human microbiome studies. |
Johnson, Ruth; Mangul, Serghei Refining the conference experience for junior scientists in the wake of climate change Journal Article arXiv, 2020. @article{Johnson2020, title = {Refining the conference experience for junior scientists in the wake of climate change}, author = {Ruth Johnson and Serghei Mangul}, url = {https://arxiv.org/abs/2002.12268}, year = {2020}, date = {2020-02-18}, journal = {arXiv}, abstract = {With the ever-increasing carbon footprint associated with conferences, scientists can learn to refine their conference experiences when they do need to travel. We offer insight on how to optimize the conference experience through attending speaker sessions, giving presentations, and networking.}, keywords = {}, pubstate = {published}, tppubtype = {article} } With the ever-increasing carbon footprint associated with conferences, scientists can learn to refine their conference experiences when they do need to travel. We offer insight on how to optimize the conference experience through attending speaker sessions, giving presentations, and networking. |
Brito, Jaqueline J; Mosqueiro, Thiago; Rotman, Jeremy; Xue, Victor; Chapski, Douglas J; la Hoz, Juan De; Matias, Paulo; Martin, Lana S; Zelikovsky, Alex; Pellegrini, Matteo; Mangul, Serghei Telescope: an interactive tool for managing large scale analysis from mobile devices Journal Article GigaScience, 9 (1), pp. giz163, 2020. Abstract | Links | BibTeX | Altmetric @article{Brito2019, title = {Telescope: an interactive tool for managing large scale analysis from mobile devices}, author = {Jaqueline J Brito and Thiago Mosqueiro and Jeremy Rotman and Victor Xue and Douglas J Chapski and Juan De la Hoz and Paulo Matias and Lana S Martin and Alex Zelikovsky and Matteo Pellegrini and Serghei Mangul}, url = {https://doi.org/10.1093/gigascience/giz163}, doi = {10.1093/gigascience/giz163}, year = {2020}, date = {2020-01-23}, journal = {GigaScience}, volume = {9}, number = {1}, pages = {giz163}, abstract = {Background In today's world of big data, computational analysis has become a key driver of biomedical research. High-performance computational facilities are capable of processing considerable volumes of data, yet often lack an easy-to-use interface to guide the user in supervising and adjusting bioinformatics analysis via a tablet or smartphone. Results To address this gap we proposed Telescope, a novel tool that interfaces with high-performance computational clusters to deliver an intuitive user interface for controlling and monitoring bioinformatics analyses in real-time. By leveraging last generation technology now ubiquitous to most researchers (such as smartphones), Telescope delivers a friendly user experience and manages conectivity and encryption under the hood. Conclusions Telescope helps to mitigate the digital divide between wet and computational laboratories in contemporary biology. By delivering convenience and ease of use through a user experience not relying on expertise with computational clusters, Telescope can help researchers close the feedback loop between bioinformatics and experimental work with minimal impact on the performance of computational tools. Telescope is freely available at https://github.com/Mangul-Lab-USC/telescope.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Background In today's world of big data, computational analysis has become a key driver of biomedical research. High-performance computational facilities are capable of processing considerable volumes of data, yet often lack an easy-to-use interface to guide the user in supervising and adjusting bioinformatics analysis via a tablet or smartphone. Results To address this gap we proposed Telescope, a novel tool that interfaces with high-performance computational clusters to deliver an intuitive user interface for controlling and monitoring bioinformatics analyses in real-time. By leveraging last generation technology now ubiquitous to most researchers (such as smartphones), Telescope delivers a friendly user experience and manages conectivity and encryption under the hood. Conclusions Telescope helps to mitigate the digital divide between wet and computational laboratories in contemporary biology. By delivering convenience and ease of use through a user experience not relying on expertise with computational clusters, Telescope can help researchers close the feedback loop between bioinformatics and experimental work with minimal impact on the performance of computational tools. Telescope is freely available at https://github.com/Mangul-Lab-USC/telescope. |
LaPierre, Nathan; Alser, Mohammed; Eskin, Eleazar; Koslicki, David; Mangul, Serghei Metalign: Efficient alignment-based metagenomic profiling via containment min hash Journal Article bioRxiv, 2020. Abstract | Links | BibTeX | Altmetric @article{LaPierre2020, title = {Metalign: Efficient alignment-based metagenomic profiling via containment min hash}, author = {Nathan LaPierre and Mohammed Alser and Eleazar Eskin and David Koslicki and Serghei Mangul}, url = {https://doi.org/10.1101/2020.01.17.910521}, doi = {10.1101/2020.01.17.910521}, year = {2020}, date = {2020-01-18}, journal = {bioRxiv}, abstract = {Whole-genome shotgun sequencing enables the analysis of microbial communities in unprecedented detail, with major implications in medicine and ecology. Predicting the presence and relative abundances of microbes in a sample, known as “metagenomic profiling”, is a critical first step in microbiome analysis. Existing profiling methods have been shown to suffer from poor false positive or false negative rates, while alignment-based approaches are often considered accurate but computationally infeasible. Here we present a novel method, Metalign, that addresses these concerns by performing efficient alignment-based metagenomic profiling. We use a containment min hash approach to reduce the reference database size dramatically before alignment and a method to estimate organism relative abundances in the sample by resolving reads aligned to multiple genomes. We show that Metalign achieves significantly improved results over existing methods on simulated datasets from a large benchmarking study, CAMI, and performs well on in vitro mock community data and environmental data from the Tara Oceans project. Metalign is freely available at https://github.com/nlapier2/Metalign, along with the results and plots used in this paper, and a docker image is also available at https://hub.docker.com/repository/docker/nlapier2/metalign.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Whole-genome shotgun sequencing enables the analysis of microbial communities in unprecedented detail, with major implications in medicine and ecology. Predicting the presence and relative abundances of microbes in a sample, known as “metagenomic profiling”, is a critical first step in microbiome analysis. Existing profiling methods have been shown to suffer from poor false positive or false negative rates, while alignment-based approaches are often considered accurate but computationally infeasible. Here we present a novel method, Metalign, that addresses these concerns by performing efficient alignment-based metagenomic profiling. We use a containment min hash approach to reduce the reference database size dramatically before alignment and a method to estimate organism relative abundances in the sample by resolving reads aligned to multiple genomes. We show that Metalign achieves significantly improved results over existing methods on simulated datasets from a large benchmarking study, CAMI, and performs well on in vitro mock community data and environmental data from the Tara Oceans project. Metalign is freely available at https://github.com/nlapier2/Metalign, along with the results and plots used in this paper, and a docker image is also available at https://hub.docker.com/repository/docker/nlapier2/metalign. |
Loeffler, Caitlin; Gibson, Keylie M; Martin, Lana S; Chang, Yutong; Rotman, Jeremy; Toma, Ian V; Mason, Christopher E; Eskin, Eleazar; Zackular, Joseph P; Crandall, Keith A; Koslicki, David; Mangul, Serghei Metagenomics for clinical diagnostics: technologies and informatics Journal Article arXiv, 2019. Abstract | Links | BibTeX | Altmetric @article{Loeffler2019b, title = {Metagenomics for clinical diagnostics: technologies and informatics}, author = {Caitlin Loeffler and Keylie M Gibson and Lana S Martin and Yutong Chang and Jeremy Rotman and Ian V Toma and Christopher E Mason and Eleazar Eskin and Joseph P Zackular and Keith A Crandall and David Koslicki and Serghei Mangul}, url = {https://arxiv.org/abs/1911.11304}, doi = {1911.11304}, year = {2019}, date = {2019-11-25}, journal = {arXiv}, abstract = {The human-associated microbiome is closely tied to human health and is of substantial clinical interest. Metagenomics-based tools are emerging for clinical diagnostics, tracking the spread of diseases, and surveillance of potential pathogens. In some cases, these tools are overcoming limitations of traditional clinical approaches. Metagenomics has limitations barring the tools from clinical validation. Once these hurdles are overcome, clinical metagenomics will inform doctors of the best, targeted treatment for their patients and provide early detection of disease. Here we present an overview of metagenomics methods with a discussion of computational challenges and limitations.}, keywords = {}, pubstate = {published}, tppubtype = {article} } The human-associated microbiome is closely tied to human health and is of substantial clinical interest. Metagenomics-based tools are emerging for clinical diagnostics, tracking the spread of diseases, and surveillance of potential pathogens. In some cases, these tools are overcoming limitations of traditional clinical approaches. Metagenomics has limitations barring the tools from clinical validation. Once these hurdles are overcome, clinical metagenomics will inform doctors of the best, targeted treatment for their patients and provide early detection of disease. Here we present an overview of metagenomics methods with a discussion of computational challenges and limitations. |
Bhat, Suraj P; Gangalum, Rajendra K; Kim, Dongjae; Mangul, Serghei; Kashyap, Raj K; Zhou, Xinkai; Elashoff, David Journal of Biological Chemistry, 294 , pp. 13530-13544, 2019. Abstract | Links | BibTeX | Altmetric @article{bhat2019transcriptional, title = {Transcriptional profiling of single fiber cells in a transgenic paradigm of an inherited childhood cataract reveals absence of molecular heterogeneity}, author = {Suraj P Bhat and Rajendra K Gangalum and Dongjae Kim and Serghei Mangul and Raj K Kashyap and Xinkai Zhou and David Elashoff}, url = {https://doi.org/10.1074/jbc.RA119.008853}, doi = {10.1074/jbc.RA119.008853}, year = {2019}, date = {2019-09-13}, journal = {Journal of Biological Chemistry}, volume = {294}, pages = {13530-13544}, publisher = {ASBMB}, abstract = {Our recent single-cell transcriptomic analysis has demonstrated that heterogeneous transcriptional activity attends molecular transition from the nascent to terminally differentiated fiber cells in the developing mouse lens. To understand the role of transcriptional heterogeneity in terminal differentiation and the functional phenotype (transparency) of this tissue, here we present a single-cell analysis of the developing lens, in a transgenic paradigm of an inherited pathology, known as the lamellar cataract. Cataracts hinder transmission of light into the eye. Lamellar cataract is the most prevalent bilateral childhood cataract. In this disease of early infancy, initially, the opacities remain confined to a few fiber cells, thus presenting an opportunity to investigate early molecular events that lead to cataractogenesis. We used a previously established paradigm that faithfully recapitulates this disease in transgenic mice. About 500 single fiber cells, manually isolated from a 2-day-old transgenic lens were interrogated individually for the expression of all known 17 crystallins and 78 other relevant genes using a Biomark HD (Fluidigm). We find that fiber cells from spatially and developmentally discrete regions of the transgenic (cataract) lens show remarkable absence of the heterogeneity of gene expression. Importantly, the molecular variability of cortical fiber cells, the hallmark of the WT lens, is absent in the transgenic cataract, suggesting absence of specific cell-type(s). Interestingly, we find a repetitive pattern of gene activity in progressive states of differentiation in the transgenic lens. This molecular dysfunction portends pathology much before the physical manifestations of the disease.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Our recent single-cell transcriptomic analysis has demonstrated that heterogeneous transcriptional activity attends molecular transition from the nascent to terminally differentiated fiber cells in the developing mouse lens. To understand the role of transcriptional heterogeneity in terminal differentiation and the functional phenotype (transparency) of this tissue, here we present a single-cell analysis of the developing lens, in a transgenic paradigm of an inherited pathology, known as the lamellar cataract. Cataracts hinder transmission of light into the eye. Lamellar cataract is the most prevalent bilateral childhood cataract. In this disease of early infancy, initially, the opacities remain confined to a few fiber cells, thus presenting an opportunity to investigate early molecular events that lead to cataractogenesis. We used a previously established paradigm that faithfully recapitulates this disease in transgenic mice. About 500 single fiber cells, manually isolated from a 2-day-old transgenic lens were interrogated individually for the expression of all known 17 crystallins and 78 other relevant genes using a Biomark HD (Fluidigm). We find that fiber cells from spatially and developmentally discrete regions of the transgenic (cataract) lens show remarkable absence of the heterogeneity of gene expression. Importantly, the molecular variability of cortical fiber cells, the hallmark of the WT lens, is absent in the transgenic cataract, suggesting absence of specific cell-type(s). Interestingly, we find a repetitive pattern of gene activity in progressive states of differentiation in the transgenic lens. This molecular dysfunction portends pathology much before the physical manifestations of the disease. |
Mangul, Serghei Interpreting and integrating big data in the life sciences Journal Article Emerging Topics in Life Sciences, 3 (4), pp. 335-341, 2019. Abstract | Links | BibTeX | Altmetric @article{mangul2019interpreting, title = {Interpreting and integrating big data in the life sciences}, author = {Serghei Mangul}, url = {https://doi.org/10.1042/ETLS20180175}, doi = {10.1042/ETLS20180175}, year = {2019}, date = {2019-06-26}, journal = {Emerging Topics in Life Sciences}, volume = {3}, number = {4}, pages = {335-341}, publisher = {Portland Press Journals portal}, abstract = {Recent advances in omics technologies have led to the broad applicability of computational techniques across various domains of life science and medical research. These technologies provide an unprecedented opportunity to collect the omics data from hundreds of thousands of individuals and to study the gene–disease association without the aid of prior assumptions about the trait biology. Despite the many advantages of modern omics technologies, interpretations of big data produced by such technologies require advanced computational algorithms. I outline key challenges that biomedical researches are facing when interpreting and integrating big omics data. I discuss the reproducibility aspect of big data analysis in the life sciences and review current practices in reproducible research. Finally, I explain the skills that biomedical researchers need to acquire to independently analyze big omics data.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Recent advances in omics technologies have led to the broad applicability of computational techniques across various domains of life science and medical research. These technologies provide an unprecedented opportunity to collect the omics data from hundreds of thousands of individuals and to study the gene–disease association without the aid of prior assumptions about the trait biology. Despite the many advantages of modern omics technologies, interpretations of big data produced by such technologies require advanced computational algorithms. I outline key challenges that biomedical researches are facing when interpreting and integrating big omics data. I discuss the reproducibility aspect of big data analysis in the life sciences and review current practices in reproducible research. Finally, I explain the skills that biomedical researchers need to acquire to independently analyze big omics data. |
Mangul, Serghei; Mosqueiro, Thiago; Abdill, Richard J; Duong, Dat; Mitchell, Keith; Sarwal, Varuni; Hill, Brian L; Brito, Jaqueline J; Littman, Russell Jared; Statz, Benjamin Challenges and recommendations to improve the installability and archival stability of omics computational tools Journal Article PLoS Biology, 17 (6), pp. e3000333, 2019. Abstract | Links | BibTeX | Altmetric @article{mangul2019challenges, title = {Challenges and recommendations to improve the installability and archival stability of omics computational tools}, author = {Serghei Mangul and Thiago Mosqueiro and Richard J Abdill and Dat Duong and Keith Mitchell and Varuni Sarwal and Brian L Hill and Jaqueline J Brito and Russell Jared Littman and Benjamin Statz}, url = {https://doi.org/10.1371/journal.pbio.3000333}, doi = {10.1371/journal.pbio.3000333}, year = {2019}, date = {2019-06-20}, journal = {PLoS Biology}, volume = {17}, number = {6}, pages = {e3000333}, publisher = {Public Library of Science}, abstract = {Developing new software tools for analysis of large-scale biological data is a key component of advancing modern biomedical research. Scientific reproduction of published findings requires running computational tools on data generated by such studies, yet little attention is presently allocated to the installability and archival stability of computational software tools. Scientific journals require data and code sharing, but none currently require authors to guarantee the continuing functionality of newly published tools. We have estimated the archival stability of computational biology software tools by performing an empirical analysis of the internet presence for 36,702 omics software resources published from 2005 to 2017. We found that almost 28% of all resources are currently not accessible through uniform resource locators (URLs) published in the paper they first appeared in. Among the 98 software tools selected for our installability test, 51% were deemed “easy to install,” and 28% of the tools failed to be installed at all because of problems in the implementation. Moreover, for papers introducing new software, we found that the number of citations significantly increased when authors provided an easy installation process. We propose for incorporation into journal policy several practical solutions for increasing the widespread installability and archival stability of published bioinformatics software.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Developing new software tools for analysis of large-scale biological data is a key component of advancing modern biomedical research. Scientific reproduction of published findings requires running computational tools on data generated by such studies, yet little attention is presently allocated to the installability and archival stability of computational software tools. Scientific journals require data and code sharing, but none currently require authors to guarantee the continuing functionality of newly published tools. We have estimated the archival stability of computational biology software tools by performing an empirical analysis of the internet presence for 36,702 omics software resources published from 2005 to 2017. We found that almost 28% of all resources are currently not accessible through uniform resource locators (URLs) published in the paper they first appeared in. Among the 98 software tools selected for our installability test, 51% were deemed “easy to install,” and 28% of the tools failed to be installed at all because of problems in the implementation. Moreover, for papers introducing new software, we found that the number of citations significantly increased when authors provided an easy installation process. We propose for incorporation into journal policy several practical solutions for increasing the widespread installability and archival stability of published bioinformatics software. |
LaPierre, Nathan; Mangul, Serghei; Alser, Mohammed; Mandric, Igor; Wu, Nicholas C; Koslicki, David; Eskin, Eleazar MiCoP: microbial community profiling method for detecting viral and fungal organisms in metagenomic samples Journal Article BMC Genomics, 20 (5), pp. 423, 2019. Abstract | Links | BibTeX | Altmetric @article{lapierre2019micop, title = {MiCoP: microbial community profiling method for detecting viral and fungal organisms in metagenomic samples}, author = {Nathan LaPierre and Serghei Mangul and Mohammed Alser and Igor Mandric and Nicholas C Wu and David Koslicki and Eleazar Eskin}, url = {https://doi.org/10.1186/s12864-019-5699-9}, doi = {10.1186/s12864-019-5699-9}, year = {2019}, date = {2019-06-06}, journal = {BMC Genomics}, volume = {20}, number = {5}, pages = {423}, publisher = {BioMed Central}, abstract = {Background High throughput sequencing has spurred the development of metagenomics, which involves the direct analysis of microbial communities in various environments such as soil, ocean water, and the human body. Many existing methods based on marker genes or k-mers have limited sensitivity or are too computationally demanding for many users. Additionally, most work in metagenomics has focused on bacteria and archaea, neglecting to study other key microbes such as viruses and eukaryotes. Results Here we present a method, MiCoP (Microbiome Community Profiling), that uses fast-mapping of reads to build a comprehensive reference database of full genomes from viruses and eukaryotes to achieve maximum read usage and enable the analysis of the virome and eukaryome in each sample. We demonstrate that mapping of metagenomic reads is feasible for the smaller viral and eukaryotic reference databases. We show that our method is accurate on simulated and mock community data and identifies many more viral and fungal species than previously-reported results on real data from the Human Microbiome Project. Conclusions MiCoP is a mapping-based method that proves more effective than existing methods at abundance profiling of viruses and eukaryotes in metagenomic samples. MiCoP can be used to detect the full diversity of these communities. The code, data, and documentation are publicly available on GitHub at: https://github.com/smangul1/MiCoP.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Background High throughput sequencing has spurred the development of metagenomics, which involves the direct analysis of microbial communities in various environments such as soil, ocean water, and the human body. Many existing methods based on marker genes or k-mers have limited sensitivity or are too computationally demanding for many users. Additionally, most work in metagenomics has focused on bacteria and archaea, neglecting to study other key microbes such as viruses and eukaryotes. Results Here we present a method, MiCoP (Microbiome Community Profiling), that uses fast-mapping of reads to build a comprehensive reference database of full genomes from viruses and eukaryotes to achieve maximum read usage and enable the analysis of the virome and eukaryome in each sample. We demonstrate that mapping of metagenomic reads is feasible for the smaller viral and eukaryotic reference databases. We show that our method is accurate on simulated and mock community data and identifies many more viral and fungal species than previously-reported results on real data from the Human Microbiome Project. Conclusions MiCoP is a mapping-based method that proves more effective than existing methods at abundance profiling of viruses and eukaryotes in metagenomic samples. MiCoP can be used to detect the full diversity of these communities. The code, data, and documentation are publicly available on GitHub at: https://github.com/smangul1/MiCoP. |
Mangul, Serghei; Martin, Lana S; Hill, Brian L; Lam, Angela Ka-Mei; Distler, Margaret G; Zelikovsky, Alex; Eskin, Eleazar; Flint, Jonathan Systematic benchmarking of omics computational tools Journal Article Nature Communications, 10 (1393), pp. 1-11, 2019. Abstract | Links | BibTeX | Altmetric @article{mangul2019systematic, title = {Systematic benchmarking of omics computational tools}, author = {Serghei Mangul and Lana S Martin and Brian L Hill and Angela Ka-Mei Lam and Margaret G Distler and Alex Zelikovsky and Eleazar Eskin and Jonathan Flint}, url = {https://doi.org/10.1038/s41467-019-09406-4}, doi = {10.1038/s41467-019-09406-4}, year = {2019}, date = {2019-03-27}, journal = {Nature Communications}, volume = {10}, number = {1393}, pages = {1-11}, publisher = {Nature Publishing Group doi=10.1038/s41467-019-09406-4 url=https://doi.org/10.1038/s41467-019-09406-4}, abstract = {Computational omics methods packaged as software have become essential to modern biological research. The increasing dependence of scientists on these powerful software tools creates a need for systematic assessment of these methods, known as benchmarking. Adopting a standardized benchmarking practice could help researchers who use omics data to better leverage recent technological innovations. Our review summarizes benchmarking practices from 25 recent studies and discusses the challenges, advantages, and limitations of benchmarking across various domains of biology. We also propose principles that can make computational biology benchmarking studies more sustainable and reproducible, ultimately increasing the transparency of biomedical data and results.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Computational omics methods packaged as software have become essential to modern biological research. The increasing dependence of scientists on these powerful software tools creates a need for systematic assessment of these methods, known as benchmarking. Adopting a standardized benchmarking practice could help researchers who use omics data to better leverage recent technological innovations. Our review summarizes benchmarking practices from 25 recent studies and discusses the challenges, advantages, and limitations of benchmarking across various domains of biology. We also propose principles that can make computational biology benchmarking studies more sustainable and reproducible, ultimately increasing the transparency of biomedical data and results. |
Mangul, Serghei; Martin, Lana S; Langmead, Ben; Sanchez-Galan, Javier E; Toma, Ian V; Hormozdiari, Fereydoun; Pevzner, Pavel; Eskin, Eleazar How bioinformatics and open data can boost basic science in countries and universities with limited resources Journal Article Nature Biotechnology, 37 (3), pp. 324, 2019. Abstract | Links | BibTeX | Altmetric @article{mangul2019bioinformatics, title = {How bioinformatics and open data can boost basic science in countries and universities with limited resources}, author = {Serghei Mangul and Lana S Martin and Ben Langmead and Javier E Sanchez-Galan and Ian V Toma and Fereydoun Hormozdiari and Pavel Pevzner and Eleazar Eskin}, url = {https://doi.org/10.1038/s41587-019-0053-y}, doi = {10.1038/s41587-019-0053-y}, year = {2019}, date = {2019-03-04}, journal = {Nature Biotechnology}, volume = {37}, number = {3}, pages = {324}, publisher = {Nature Publishing Group}, abstract = {Providing training and access to standard computing hardware and cloud-based resources can enable scientists in lower-resource institutions and countries to reanalyze published ‘-omics’ data and produce career-enhancing STEM research.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Providing training and access to standard computing hardware and cloud-based resources can enable scientists in lower-resource institutions and countries to reanalyze published ‘-omics’ data and produce career-enhancing STEM research. |
Mangul, Serghei; Yang, Harry Taegyun; Eskin, Eleazar; Zaitlen, Noah Hidden Treasures in Contemporary RNA Sequencing Book Chapter Hidden Treasures in Contemporary RNA Sequencing. SpringerBriefs in Computer Science, pp. 1–93, Springer, 2019. Abstract | Links | BibTeX | Altmetric @inbook{mangul2019hidden, title = {Hidden Treasures in Contemporary RNA Sequencing}, author = {Serghei Mangul and Harry Taegyun Yang and Eleazar Eskin and Noah Zaitlen}, url = {https://doi.org/10.1007/978-3-030-13973-5_1}, doi = {10.1007/978-3-030-13973-5_1}, year = {2019}, date = {2019-03-02}, booktitle = {Hidden Treasures in Contemporary RNA Sequencing. SpringerBriefs in Computer Science}, pages = {1--93}, publisher = {Springer}, abstract = {High throughput RNA sequencing technologies have provided unprecedented opportunity to explore the individual transcriptome. Unmapped reads, the reads falling to map to the human reference, are a large and often overlooked output of standard RNA-Seq analyses; the hidden treasure in the contemporary RNA-Seq analysis is within the unmapped reads, illuminating previously unexplored biological insights. Here we develop Read Origin Protocol (ROP) to discover the source of all reads originating from complex RNA molecules, recombinant T and B cell receptors, and microbial communities. We applied ROP to 10,641 samples across 2630 individuals from 54 diverse adult human tissues. Our approach can account for 99.9% of 1 trillion reads of various read length. Using in-house RNA-Seq data, we show that immune profiles of asthmatic individuals are significantly different from the profiles of control individuals, with decreased average per sample T and B cell receptor diversity. We also show that microbiomes can be detected in human bloods via RNA-Sequencing and may elucidate important clinical changes in patients with schizophrenia. Furthermore, we demonstrate that receptor-derived reads among other hidden reads can be used to characterize the overall Ig repertoire across diverse human tissues using RNA-Sequencing. Our results demonstrate the potential of ROP to exploit the hidden treasure in contemporary RNA-Sequencing in order to better understand the functional mechanisms underlying connections between the immune system, microbiome, human gene expression, and disease etiology.}, keywords = {}, pubstate = {published}, tppubtype = {inbook} } High throughput RNA sequencing technologies have provided unprecedented opportunity to explore the individual transcriptome. Unmapped reads, the reads falling to map to the human reference, are a large and often overlooked output of standard RNA-Seq analyses; the hidden treasure in the contemporary RNA-Seq analysis is within the unmapped reads, illuminating previously unexplored biological insights. Here we develop Read Origin Protocol (ROP) to discover the source of all reads originating from complex RNA molecules, recombinant T and B cell receptors, and microbial communities. We applied ROP to 10,641 samples across 2630 individuals from 54 diverse adult human tissues. Our approach can account for 99.9% of 1 trillion reads of various read length. Using in-house RNA-Seq data, we show that immune profiles of asthmatic individuals are significantly different from the profiles of control individuals, with decreased average per sample T and B cell receptor diversity. We also show that microbiomes can be detected in human bloods via RNA-Sequencing and may elucidate important clinical changes in patients with schizophrenia. Furthermore, we demonstrate that receptor-derived reads among other hidden reads can be used to characterize the overall Ig repertoire across diverse human tissues using RNA-Sequencing. Our results demonstrate the potential of ROP to exploit the hidden treasure in contemporary RNA-Sequencing in order to better understand the functional mechanisms underlying connections between the immune system, microbiome, human gene expression, and disease etiology. |
Mangul, Serghei; Martin, Lana S; Eskin, Eleazar; Blekhman, Ran Improving the usability and archival stability of bioinformatics software Journal Article Genome Biology, 20 (47), pp. 1-3, 2019. Abstract | Links | BibTeX | Altmetric @article{mangul2019improving, title = {Improving the usability and archival stability of bioinformatics software}, author = {Serghei Mangul and Lana S Martin and Eleazar Eskin and Ran Blekhman}, url = {https://doi.org/10.1186/s13059-019-1649-8}, doi = {10.1186/s13059-019-1649-8}, year = {2019}, date = {2019-02-27}, journal = {Genome Biology}, volume = {20}, number = {47}, pages = {1-3}, publisher = {BioMed Central doi=10.1186/s13059-019-1649-8 url=https://doi.org/10.1186/s13059-019-1649-8}, abstract = {Implementation of bioinformatics software involves numerous unique challenges; a rigorous standardized approach is needed to examine software tools prior to their publication.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Implementation of bioinformatics software involves numerous unique challenges; a rigorous standardized approach is needed to examine software tools prior to their publication. |
Bhat, Suraj P; Gangalum, Rajendra K; Kim, Dongjae; Mangul, Serghei; Kashyap, Raj K; Zhou, Xinkai; Elashoff, David Absence of Single Cell Transcriptional Heterogeneity in the transgenic paradigm of the inherited Lamellar cataract Journal Article Investigative Ophthalmology & Visual Science, 60 (9), pp. 1381-1381, 2019. @article{bhat2019absence, title = {Absence of Single Cell Transcriptional Heterogeneity in the transgenic paradigm of the inherited Lamellar cataract}, author = {Suraj P Bhat and Rajendra K Gangalum and Dongjae Kim and Serghei Mangul and Raj K Kashyap and Xinkai Zhou and David Elashoff}, year = {2019}, date = {2019-01-01}, journal = {Investigative Ophthalmology & Visual Science}, volume = {60}, number = {9}, pages = {1381-1381}, publisher = {The Association for Research in Vision and Ophthalmology}, abstract = {We have recently shown that highly variable transcriptional activity, in single cortical fiber cells, mediates the transition from nascent to terminally differentiated fiber cells, in the developing ocular lens. We have now investigated the status of this heterogeneity in cataractogenesis, with a motivation to probe the earliest molecular events that presage the visible appearance of the cataract pathology. We have used the postnatal inherited opacities in a transgenic model of the most prevalent childhood cataract, the Lamellar cataract, as a paradigm.}, keywords = {}, pubstate = {published}, tppubtype = {article} } We have recently shown that highly variable transcriptional activity, in single cortical fiber cells, mediates the transition from nascent to terminally differentiated fiber cells, in the developing ocular lens. We have now investigated the status of this heterogeneity in cataractogenesis, with a motivation to probe the earliest molecular events that presage the visible appearance of the cataract pathology. We have used the postnatal inherited opacities in a transgenic model of the most prevalent childhood cataract, the Lamellar cataract, as a paradigm. |
Gangalum, Rajendra K; Kim, Dongjae; Kashyap, Raj K; Mangul, Serghei; Zhou, Xinkai; Elashoff, David; Bhat, Suraj P Spatial Analysis of Single Fiber Cells of the Developing Ocular Lens Reveals Regulated Heterogeneity of Gene Expression Journal Article iScience, 10 , pp. 66–79, 2018. Abstract | Links | BibTeX | Altmetric @article{gangalum2018spatial, title = {Spatial Analysis of Single Fiber Cells of the Developing Ocular Lens Reveals Regulated Heterogeneity of Gene Expression}, author = {Rajendra K Gangalum and Dongjae Kim and Raj K Kashyap and Serghei Mangul and Xinkai Zhou and David Elashoff and Suraj P Bhat}, url = {https://doi.org/10.1016/j.isci.2018.11.024}, doi = {10.1016/j.isci.2018.11.024}, year = {2018}, date = {2018-12-21}, journal = {iScience}, volume = {10}, pages = {66--79}, publisher = {Elsevier}, abstract = {The developing eye lens presents an exceptional paradigm for spatial transcriptomics. It is composed of highly organized long, slender transparent fiber cells, which differentiate from the edges of the anterior epithelium of the lens (equator), attended by high expression of crystallins, which generates transparency. Every fiber cell, therefore, is an optical unit whose refractive properties derive from its gene activity. Here, we probe this tangible relationship between the gene activity and the phenotype by studying the expression of all known 17 crystallins and 77 other non-crystallin genes in single fiber cells isolated from three states/regions of differentiation, allowing us to follow molecular progression at the single-cell level. The data demonstrate highly variable gene activity in cortical fibers, interposed between the nascent and the terminally differentiated fiber cell transcription. These data suggest that the so-called stochastic, highly heterogeneous gene activity is a regulated intermediate in the realization of a functional phenotype.}, keywords = {}, pubstate = {published}, tppubtype = {article} } The developing eye lens presents an exceptional paradigm for spatial transcriptomics. It is composed of highly organized long, slender transparent fiber cells, which differentiate from the edges of the anterior epithelium of the lens (equator), attended by high expression of crystallins, which generates transparency. Every fiber cell, therefore, is an optical unit whose refractive properties derive from its gene activity. Here, we probe this tangible relationship between the gene activity and the phenotype by studying the expression of all known 17 crystallins and 77 other non-crystallin genes in single fiber cells isolated from three states/regions of differentiation, allowing us to follow molecular progression at the single-cell level. The data demonstrate highly variable gene activity in cortical fibers, interposed between the nascent and the terminally differentiated fiber cell transcription. These data suggest that the so-called stochastic, highly heterogeneous gene activity is a regulated intermediate in the realization of a functional phenotype. |
Mitchell, Keith; Dao, Chris; Freise, Amanda; Mangul, Serghei; Parker, Jordan Moberg PUMA: A tool for processing 16S rRNA taxonomy data for analysis and visualization Journal Article bioRxiv, pp. 482380, 2018. Abstract | Links | BibTeX | Altmetric @article{mitchell2018puma, title = {PUMA: A tool for processing 16S rRNA taxonomy data for analysis and visualization}, author = {Keith Mitchell and Chris Dao and Amanda Freise and Serghei Mangul and Jordan Moberg Parker}, url = {https://doi.org/10.1101/482380}, doi = {10.1101/482380}, year = {2018}, date = {2018-11-29}, journal = {bioRxiv}, pages = {482380}, publisher = {Cold Spring Harbor Laboratory}, abstract = {Microbial community profiling and functional inference via 16S rRNA analysis is quickly expanding across various areas of microbiology due to improvements to technology. There are numerous platforms for producing 16S rRNA taxonomic data which often vary in file and sequence formatting, creating a common barrier in microbiome studies. Additionally, many of the methods for analyzing and visualizing this sequencing data each require their own specific formatting. As a result, efficient and reproducible comparative analysis of taxonomic data and corresponding metadata in multiple programs remains a challenge in the investigation of microbial communities. PUMA, the Program for Unifying Microbiome Analysis, alleviates this problem in microbiome studies by allowing users to take advantage of numerous 16S rRNA taxonomic identification platforms and analysis tools in an efficient manner. PUMA accepts sequencing results from several taxonomic identification platforms and then automates configuration of data and file types for analysis and visualization via many popular tools. The protocol accomplishes this by producing a variety of properly configured, annotated, and altered files for both analysis and visualization of taxonomic community profiles and inferred functional profiles. PUMA provides an easy and flexible interface to accommodate for a variety of users to produce all files needed for all-inclusive analysis of targeted amplicon sequencing studies. PUMA is an unprecedented open-source solution for unifying multiple microbiome analysis softwares and uses an adaptable implementation with the potential to improve and consolidate the state of microbiome research.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Microbial community profiling and functional inference via 16S rRNA analysis is quickly expanding across various areas of microbiology due to improvements to technology. There are numerous platforms for producing 16S rRNA taxonomic data which often vary in file and sequence formatting, creating a common barrier in microbiome studies. Additionally, many of the methods for analyzing and visualizing this sequencing data each require their own specific formatting. As a result, efficient and reproducible comparative analysis of taxonomic data and corresponding metadata in multiple programs remains a challenge in the investigation of microbial communities. PUMA, the Program for Unifying Microbiome Analysis, alleviates this problem in microbiome studies by allowing users to take advantage of numerous 16S rRNA taxonomic identification platforms and analysis tools in an efficient manner. PUMA accepts sequencing results from several taxonomic identification platforms and then automates configuration of data and file types for analysis and visualization via many popular tools. The protocol accomplishes this by producing a variety of properly configured, annotated, and altered files for both analysis and visualization of taxonomic community profiles and inferred functional profiles. PUMA provides an easy and flexible interface to accommodate for a variety of users to produce all files needed for all-inclusive analysis of targeted amplicon sequencing studies. PUMA is an unprecedented open-source solution for unifying multiple microbiome analysis softwares and uses an adaptable implementation with the potential to improve and consolidate the state of microbiome research. |
Chiang, Charleston WK; Mangul, Serghei; Robles, Christopher; Sankararaman, Sriram A comprehensive map of genetic variation in the world’s largest ethnic group—Han Chinese Journal Article Molecular Biology and Evolution, 35 (11), pp. 2736–2750, 2018. Abstract | Links | BibTeX | Altmetric @article{chiang2018comprehensive, title = {A comprehensive map of genetic variation in the world’s largest ethnic group—Han Chinese}, author = {Charleston WK Chiang and Serghei Mangul and Christopher Robles and Sriram Sankararaman}, url = {https://doi.org/10.1093/molbev/msy170}, doi = {10.1093/molbev/msy170}, year = {2018}, date = {2018-08-30}, journal = {Molecular Biology and Evolution}, volume = {35}, number = {11}, pages = {2736--2750}, publisher = {Oxford University Press}, abstract = {As are most non-European populations, the Han Chinese are relatively understudied in population and medical genetics studies. From low-coverage whole-genome sequencing of 11,670 Han Chinese women we present a catalog of 25,057,223 variants, including 548,401 novel variants that are seen at least 10 times in our data set. Individuals from this data set came from 24 out of 33 administrative divisions across China (including 19 provinces, 4 municipalities, and 1 autonomous region), thus allowing us to study population structure, genetic ancestry, and local adaptation in Han Chinese. We identified previously unrecognized population structure along the East–West axis of China, demonstrated a general pattern of isolation-by-distance among Han Chinese, and reported unique regional signals of admixture, such as European influences among the Northwestern provinces of China. Furthermore, we identified a number of highly differentiated, putatively adaptive, loci (e.g., MTHFR, ADH7, and FADS, among others) that may be driven by immune response, climate, and diet in the Han Chinese. Finally, we have made available allele frequency estimates stratified by administrative divisions across China in the Geography of Genetic Variant browser for the broader community. By leveraging the largest currently available genetic data set for Han Chinese, we have gained insights into the history and population structure of the world’s largest ethnic group.}, keywords = {}, pubstate = {published}, tppubtype = {article} } As are most non-European populations, the Han Chinese are relatively understudied in population and medical genetics studies. From low-coverage whole-genome sequencing of 11,670 Han Chinese women we present a catalog of 25,057,223 variants, including 548,401 novel variants that are seen at least 10 times in our data set. Individuals from this data set came from 24 out of 33 administrative divisions across China (including 19 provinces, 4 municipalities, and 1 autonomous region), thus allowing us to study population structure, genetic ancestry, and local adaptation in Han Chinese. We identified previously unrecognized population structure along the East–West axis of China, demonstrated a general pattern of isolation-by-distance among Han Chinese, and reported unique regional signals of admixture, such as European influences among the Northwestern provinces of China. Furthermore, we identified a number of highly differentiated, putatively adaptive, loci (e.g., MTHFR, ADH7, and FADS, among others) that may be driven by immune response, climate, and diet in the Han Chinese. Finally, we have made available allele frequency estimates stratified by administrative divisions across China in the Geography of Genetic Variant browser for the broader community. By leveraging the largest currently available genetic data set for Han Chinese, we have gained insights into the history and population structure of the world’s largest ethnic group. |
Loohuis, Loes Olde M; Mangul, Serghei; Ori, Anil PS; Jospin, Guillaume; Koslicki, David; Yang, Harry Taegyun; Wu, Timothy; Boks, Marco P; Lomen-Hoerth, Catherine; Wiedau-Pazos, Martina Transcriptome analysis in whole blood reveals increased microbial diversity in schizophrenia Journal Article Translational Psychiatry, 8 (1), pp. 96, 2018. Abstract | Links | BibTeX | Altmetric @article{loohuis2018transcriptome, title = {Transcriptome analysis in whole blood reveals increased microbial diversity in schizophrenia}, author = {Loes Olde M Loohuis and Serghei Mangul and Anil PS Ori and Guillaume Jospin and David Koslicki and Harry Taegyun Yang and Timothy Wu and Marco P Boks and Catherine Lomen-Hoerth and Martina Wiedau-Pazos}, url = {https://doi.org/10.1038/s41398-018-0107-9}, doi = {10.1038/s41398-018-0107-9}, year = {2018}, date = {2018-05-10}, journal = {Translational Psychiatry}, volume = {8}, number = {1}, pages = {96}, publisher = {Nature Publishing Group}, abstract = {The role of the human microbiome in health and disease is increasingly appreciated. We studied the composition of microbial communities present in blood across 192 individuals, including healthy controls and patients with three disorders affecting the brain: schizophrenia, amyotrophic lateral sclerosis, and bipolar disorder. By using high-quality unmapped RNA sequencing reads as candidate microbial reads, we performed profiling of microbial transcripts detected in whole blood. We were able to detect a wide range of bacterial and archaeal phyla in blood. Interestingly, we observed an increased microbial diversity in schizophrenia patients compared to the three other groups. We replicated this finding in an independent schizophrenia case–control cohort. This increased diversity is inversely correlated with estimated cell abundance of a subpopulation of CD8+ memory T cells in healthy controls, supporting a link between microbial products found in blood, immunity and schizophrenia.}, keywords = {}, pubstate = {published}, tppubtype = {article} } The role of the human microbiome in health and disease is increasingly appreciated. We studied the composition of microbial communities present in blood across 192 individuals, including healthy controls and patients with three disorders affecting the brain: schizophrenia, amyotrophic lateral sclerosis, and bipolar disorder. By using high-quality unmapped RNA sequencing reads as candidate microbial reads, we performed profiling of microbial transcripts detected in whole blood. We were able to detect a wide range of bacterial and archaeal phyla in blood. Interestingly, we observed an increased microbial diversity in schizophrenia patients compared to the three other groups. We replicated this finding in an independent schizophrenia case–control cohort. This increased diversity is inversely correlated with estimated cell abundance of a subpopulation of CD8+ memory T cells in healthy controls, supporting a link between microbial products found in blood, immunity and schizophrenia. |
Mangul, Serghei; Martin, Lana S; Eskin, Eleazar Involving undergraduates in genomics research to narrow the education--research gap Journal Article Nature Biotechnology, 36 (4), pp. 369, 2018. Abstract | Links | BibTeX | Altmetric @article{mangul2018involving, title = {Involving undergraduates in genomics research to narrow the education--research gap}, author = {Serghei Mangul and Lana S Martin and Eleazar Eskin}, url = {https://doi.org/10.1038/nbt.4113}, doi = {10.1038/nbt.4113}, year = {2018}, date = {2018-04-05}, journal = {Nature Biotechnology}, volume = {36}, number = {4}, pages = {369}, publisher = {Nature Publishing Group}, abstract = {Engaging undergraduates in computational tasks can improve genomic research laboratory productivity, benefiting both students and senior laboratory members.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Engaging undergraduates in computational tasks can improve genomic research laboratory productivity, benefiting both students and senior laboratory members. |
Mangul, Serghei; Yang, Harry Taegyun; Strauli, Nicolas; Gruhl, Franziska; Porath, Hagit T; Hsieh, Kevin; Chen, Linus; Daley, Timothy; Christenson, Stephanie; Wesolowska-Andersen, Agata ROP: dumpster diving in RNA-sequencing to find the source of 1 trillion reads across diverse adult human tissues Journal Article Genome Biology, 19 (1), pp. 36, 2018. Abstract | Links | BibTeX | Altmetric @article{mangul2018rop, title = {ROP: dumpster diving in RNA-sequencing to find the source of 1 trillion reads across diverse adult human tissues}, author = {Serghei Mangul and Harry Taegyun Yang and Nicolas Strauli and Franziska Gruhl and Hagit T Porath and Kevin Hsieh and Linus Chen and Timothy Daley and Stephanie Christenson and Agata Wesolowska-Andersen}, url = {https://doi.org/10.1186/s13059-018-1403-7}, doi = {10.1186/s13059-018-1403-7}, year = {2018}, date = {2018-02-02}, journal = {Genome Biology}, volume = {19}, number = {1}, pages = {36}, publisher = {BioMed Central}, abstract = {High-throughput RNA-sequencing (RNA-seq) technologies provide an unprecedented opportunity to explore the individual transcriptome. Unmapped reads are a large and often overlooked output of standard RNA-seq analyses. Here, we present Read Origin Protocol (ROP), a tool for discovering the source of all reads originating from complex RNA molecules. We apply ROP to samples across 2630 individuals from 54 diverse human tissues. Our approach can account for 99.9% of 1 trillion reads of various read length. Additionally, we use ROP to investigate the functional mechanisms underlying connections between the immune system, microbiome, and disease. ROP is freely available at https://github.com/smangul1/rop/wiki.}, keywords = {}, pubstate = {published}, tppubtype = {article} } High-throughput RNA-sequencing (RNA-seq) technologies provide an unprecedented opportunity to explore the individual transcriptome. Unmapped reads are a large and often overlooked output of standard RNA-seq analyses. Here, we present Read Origin Protocol (ROP), a tool for discovering the source of all reads originating from complex RNA molecules. We apply ROP to samples across 2630 individuals from 54 diverse human tissues. Our approach can account for 99.9% of 1 trillion reads of various read length. Additionally, we use ROP to investigate the functional mechanisms underlying connections between the immune system, microbiome, and disease. ROP is freely available at https://github.com/smangul1/rop/wiki. |
Mangul, Serghei; Martin, Lana S; Hoffmann, Alexander; Pellegrini, Matteo; Eskin, Eleazar Addressing the digital divide in contemporary biology: lessons from teaching UNIX Journal Article Trends in Biotechnology, 35 (10), pp. 901–903, 2017. Abstract | Links | BibTeX | Altmetric @article{mangul2017addressing, title = {Addressing the digital divide in contemporary biology: lessons from teaching UNIX}, author = {Serghei Mangul and Lana S Martin and Alexander Hoffmann and Matteo Pellegrini and Eleazar Eskin}, url = {https://doi.org/10.1016/j.tibtech.2017.06.007}, doi = {10.1016/j.tibtech.2017.06.007}, year = {2017}, date = {2017-07-15}, journal = {Trends in Biotechnology}, volume = {35}, number = {10}, pages = {901--903}, publisher = {Elsevier}, abstract = {Life and medical science researchers increasingly rely on applications that lack a graphical interface. Scientists who are not trained in computer science face an enormous challenge analyzing high-throughput data. We present a training model for use of command-line tools when the learner has little to no prior knowledge of UNIX.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Life and medical science researchers increasingly rely on applications that lack a graphical interface. Scientists who are not trained in computer science face an enormous challenge analyzing high-throughput data. We present a training model for use of command-line tools when the learner has little to no prior knowledge of UNIX. |
Artyomenko, Alexander; Wu, Nicholas C; Mangul, Serghei; Eskin, Eleazar; Sun, Ren; Zelikovsky, Alex Long single-molecule reads can resolve the complexity of the influenza virus composed of rare, closely related mutant variants Journal Article Journal of Computational Biology, 24 (6), pp. 558–570, 2017. Abstract | Links | BibTeX | Altmetric @article{artyomenko2017long, title = {Long single-molecule reads can resolve the complexity of the influenza virus composed of rare, closely related mutant variants}, author = {Alexander Artyomenko and Nicholas C Wu and Serghei Mangul and Eleazar Eskin and Ren Sun and Alex Zelikovsky}, url = {https://doi.org/10.1089/cmb.2016.0146}, doi = {10.1089/cmb.2016.0146}, year = {2017}, date = {2017-07-01}, journal = {Journal of Computational Biology}, volume = {24}, number = {6}, pages = {558--570}, publisher = {Mary Ann Liebert, Inc. 140 Huguenot Street, 3rd Floor New Rochelle, NY 10801 USA}, abstract = {As a result of a high rate of mutations and recombination events, an RNA-virus exists as a heterogeneous “swarm” of mutant variants. The long read length offered by single-molecule sequencing technologies allows each mutant variant to be sequenced in a single pass. However, high error rate limits the ability to reconstruct heterogeneous viral population composed of rare, related mutant variants. In this article, we present two single-nucleotide variants (2SNV), a method able to tolerate the high error rate of the single-molecule protocol and reconstruct mutant variants. 2SNV uses linkage between single-nucleotide variations to efficiently distinguish them from read errors. To benchmark the sensitivity of 2SNV, we performed a single-molecule sequencing experiment on a sample containing a titrated level of known viral mutant variants. Our method is able to accurately reconstruct clone with frequency of 0.2% and distinguish clones that differed in only two nucleotides distantly located on the genome. 2SNV outperforms existing methods for full-length viral mutant reconstruction.}, keywords = {}, pubstate = {published}, tppubtype = {article} } As a result of a high rate of mutations and recombination events, an RNA-virus exists as a heterogeneous “swarm” of mutant variants. The long read length offered by single-molecule sequencing technologies allows each mutant variant to be sequenced in a single pass. However, high error rate limits the ability to reconstruct heterogeneous viral population composed of rare, related mutant variants. In this article, we present two single-nucleotide variants (2SNV), a method able to tolerate the high error rate of the single-molecule protocol and reconstruct mutant variants. 2SNV uses linkage between single-nucleotide variations to efficiently distinguish them from read errors. To benchmark the sensitivity of 2SNV, we performed a single-molecule sequencing experiment on a sample containing a titrated level of known viral mutant variants. Our method is able to accurately reconstruct clone with frequency of 0.2% and distinguish clones that differed in only two nucleotides distantly located on the genome. 2SNV outperforms existing methods for full-length viral mutant reconstruction. |
Mangul, Serghei; Yang, Harry Taegyun; Hormozdiari, Farhad; Dainis, Alexandra Marie; Tseng, Elizabeth; Ashley, Euan A; Zelikovsky, Alex; Eskin, Eleazar HapIso: an accurate method for the haplotype-specific isoforms reconstruction from long single-molecule reads Journal Article IEEE Transactions on Nanobioscience, 16 (2), pp. 108–115, 2017. Abstract | Links | BibTeX | Altmetric @article{mangul2017hapiso, title = {HapIso: an accurate method for the haplotype-specific isoforms reconstruction from long single-molecule reads}, author = {Serghei Mangul and Harry Taegyun Yang and Farhad Hormozdiari and Alexandra Marie Dainis and Elizabeth Tseng and Euan A Ashley and Alex Zelikovsky and Eleazar Eskin}, url = {https://doi.org/10.1109/TNB.2017.2675981}, doi = {10.1109/TNB.2017.2675981}, year = {2017}, date = {2017-03-17}, journal = {IEEE Transactions on Nanobioscience}, volume = {16}, number = {2}, pages = {108--115}, publisher = {IEEE}, abstract = {Sequencing of RNA provides the possibility to study an individual's transcriptome landscape and determine allelic expression ratios. Single-molecule protocols generate multi-kilobase reads longer than most transcripts, allowing sequencing of complete haplotype isoforms. This allows partitioning the reads into two parental haplotypes. While the read length of the single-molecule protocols is long, the relatively high error rate limits the ability to accurately detect the genetic variants and assemble them into the haplotype-specific isoforms. In this paper, we present Haplotype-specific Isoform reconstruction (HapIso), a method able to tolerate the relatively high error rate of the single-molecule platform and partition the isoform reads into the parental alleles. Phasing the reads according to the allele of origin allows our method to efficiently distinguish between the read errors and the true biological mutations. HapIso uses a k -means clustering algorithm aiming to group the reads into two meaningful clusters maximizing the similarity of the reads within the cluster and minimizing the similarity of the reads from different clusters. Each cluster corresponds to a parental haplotype. We used the family pedigree information to evaluate our approach. Experimental validation suggests that HapIso is able to tolerate the relatively high error rate and accurately partition the reads into the parental alleles of the isoform transcripts. We also applied HapIso to novel clinical single-molecule RNA-Seq data to estimate allele-specific expression of genes of interest. Our method was able to correct reads and determine Glu1883Lys point mutation of clinical significance validated by GeneDx HCM panel. Furthermore, our method is the first method able to reconstruct the haplotype-specific isoforms from long single-molecule reads}, keywords = {}, pubstate = {published}, tppubtype = {article} } Sequencing of RNA provides the possibility to study an individual's transcriptome landscape and determine allelic expression ratios. Single-molecule protocols generate multi-kilobase reads longer than most transcripts, allowing sequencing of complete haplotype isoforms. This allows partitioning the reads into two parental haplotypes. While the read length of the single-molecule protocols is long, the relatively high error rate limits the ability to accurately detect the genetic variants and assemble them into the haplotype-specific isoforms. In this paper, we present Haplotype-specific Isoform reconstruction (HapIso), a method able to tolerate the relatively high error rate of the single-molecule platform and partition the isoform reads into the parental alleles. Phasing the reads according to the allele of origin allows our method to efficiently distinguish between the read errors and the true biological mutations. HapIso uses a k -means clustering algorithm aiming to group the reads into two meaningful clusters maximizing the similarity of the reads within the cluster and minimizing the similarity of the reads from different clusters. Each cluster corresponds to a parental haplotype. We used the family pedigree information to evaluate our approach. Experimental validation suggests that HapIso is able to tolerate the relatively high error rate and accurately partition the reads into the parental alleles of the isoform transcripts. We also applied HapIso to novel clinical single-molecule RNA-Seq data to estimate allele-specific expression of genes of interest. Our method was able to correct reads and determine Glu1883Lys point mutation of clinical significance validated by GeneDx HCM panel. Furthermore, our method is the first method able to reconstruct the haplotype-specific isoforms from long single-molecule reads |
Mangul, Serghei; Driesche, Sarah Van; Martin, Lana S; Martin, Kelsey C; Eskin, Eleazar UMI-Reducer: Collapsing duplicate sequencing reads via Unique Molecular Identifiers Journal Article bioRxiv, 2017. Abstract | Links | BibTeX | Altmetric @article{mangul2017umi, title = {UMI-Reducer: Collapsing duplicate sequencing reads via Unique Molecular Identifiers}, author = {Serghei Mangul and Sarah Van Driesche and Lana S Martin and Kelsey C Martin and Eleazar Eskin}, url = {https://doi.org/10.1101/103267}, doi = {10.1101/103267}, year = {2017}, date = {2017-01-25}, journal = {bioRxiv}, publisher = {Cold Spring Harbor Laboratory}, abstract = {Every sequencing library contains duplicate reads. While many duplicates arise during polymerase chain reaction (PCR), some duplicates derive from multiple identical fragments of mRNA present in the original lysate (termed “biological duplicates”). Unique Molecular Identifiers (UMIs) are random oligonucleotide sequences that allow differentiation between technical and biological duplicates. Here we report the development of UMI-Reducer, a new computational tool for processing and differentiating PCR duplicates from biological duplicates. UMI-Reducer uses UMIs and the mapping position of the read to identify and collapse reads that are technical duplicates. Remaining true biological reads are further used for bias-free estimate of mRNA abundance in the original lysate. This strategy is of particular use for libraries made from low amounts of starting material, which typically require additional cycles of PCR and therefore are most prone to PCR duplicate bias.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Every sequencing library contains duplicate reads. While many duplicates arise during polymerase chain reaction (PCR), some duplicates derive from multiple identical fragments of mRNA present in the original lysate (termed “biological duplicates”). Unique Molecular Identifiers (UMIs) are random oligonucleotide sequences that allow differentiation between technical and biological duplicates. Here we report the development of UMI-Reducer, a new computational tool for processing and differentiating PCR duplicates from biological duplicates. UMI-Reducer uses UMIs and the mapping position of the read to identify and collapse reads that are technical duplicates. Remaining true biological reads are further used for bias-free estimate of mRNA abundance in the original lysate. This strategy is of particular use for libraries made from low amounts of starting material, which typically require additional cycles of PCR and therefore are most prone to PCR duplicate bias. |
Thomas, Brandon; Karimzada, Mohammed; Spreafico, Roberto; Mangul, Serghei; Botten, Jason W; Rotman, Jeremy; Wesel, Kevin; Binder, Pratibha S; Gharavi, Nima; Chesnut, Robert W 104 Lack of human papilloma virus transcription in cutaneous squamous cell carcinoma stratified by histological grade and host immune status Journal Article Journal of Investigative Dermatology, 137 (5), pp. S18, 2017. Abstract | Links | BibTeX | Altmetric @article{thomas2017104, title = {104 Lack of human papilloma virus transcription in cutaneous squamous cell carcinoma stratified by histological grade and host immune status}, author = {Brandon Thomas and Mohammed Karimzada and Roberto Spreafico and Serghei Mangul and Jason W Botten and Jeremy Rotman and Kevin Wesel and Pratibha S Binder and Nima Gharavi and Robert W Chesnut}, url = {https://doi.org/10.1016/j.jid.2017.02.118}, doi = {10.1016/j.jid.2017.02.118}, year = {2017}, date = {2017-01-01}, journal = {Journal of Investigative Dermatology}, volume = {137}, number = {5}, pages = {S18}, publisher = {Elsevier}, abstract = {Human Papilloma Virus (HPV) infection is known to contribute to mucosal (m)SCC, but its role in cutaneous (c)SCC progression remains unclear, especially in lesions determined to be at high-risk for metastasis. We hypothesized that histologically high grade cSCCs in immunosuppressed patients would display increased transcriptional activity of HPV when compared to low histologic grade lesions in otherwise healthy patients. To assess the role of viruses in cSCC pathogenesis we utilized high throughput RNA sequencing across risk-stratified lesions. A total of 22 skin excisions (11 classified as high grade in immunocompromised patients, 8 classified as low grade in otherwise healthy patients, and 3 as normal skin) were used for detection of any non-human RNA. Reads were aligned to known viral transcriptomes using our recently developed Microbiome Coverage Profiler. While approximately two-thirds of all samples tested positive for HPV gDNA, no skin sample had detectable expression of HPV RNA. Instead, many were found to have expression of Human Endogenous Retroviruses, Simian Virus 40, and Staphylococcus Prophages, while analysis of published datasets of sequenced HeLa cells demonstrated numerous RNA reads for HPV. These results suggest that either HPV does not participate in cSCC development, or facilitates cSCC initiation without effecting tumor progression. The ability to monitor viral and prophage gene expression in skin biopsies will provide insights into the interplay of host-pathogen interactions, and the framework described herein can be used to analyze skin biopsies to facilitate understanding in cases where pathogens are thought to contribute to disease pathogenesis.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Human Papilloma Virus (HPV) infection is known to contribute to mucosal (m)SCC, but its role in cutaneous (c)SCC progression remains unclear, especially in lesions determined to be at high-risk for metastasis. We hypothesized that histologically high grade cSCCs in immunosuppressed patients would display increased transcriptional activity of HPV when compared to low histologic grade lesions in otherwise healthy patients. To assess the role of viruses in cSCC pathogenesis we utilized high throughput RNA sequencing across risk-stratified lesions. A total of 22 skin excisions (11 classified as high grade in immunocompromised patients, 8 classified as low grade in otherwise healthy patients, and 3 as normal skin) were used for detection of any non-human RNA. Reads were aligned to known viral transcriptomes using our recently developed Microbiome Coverage Profiler. While approximately two-thirds of all samples tested positive for HPV gDNA, no skin sample had detectable expression of HPV RNA. Instead, many were found to have expression of Human Endogenous Retroviruses, Simian Virus 40, and Staphylococcus Prophages, while analysis of published datasets of sequenced HeLa cells demonstrated numerous RNA reads for HPV. These results suggest that either HPV does not participate in cSCC development, or facilitates cSCC initiation without effecting tumor progression. The ability to monitor viral and prophage gene expression in skin biopsies will provide insights into the interplay of host-pathogen interactions, and the framework described herein can be used to analyze skin biopsies to facilitate understanding in cases where pathogens are thought to contribute to disease pathogenesis. |
Kang, Eun Yong; Martin, Lisa J; Mangul, Serghei; Isvilanonda, Warin; Zou, Jennifer; Ben-David, Eyal; Han, Buhm; Lusis, Aldons J; Shifman, Sagiv; Eskin, Eleazar Discovering single nucleotide polymorphisms regulating human gene expression using allele specific expression from RNA-seq data Journal Article Genetics, 204 (3), pp. 1057–1064, 2016. Abstract | Links | BibTeX | Altmetric @article{kang2016discovering, title = {Discovering single nucleotide polymorphisms regulating human gene expression using allele specific expression from RNA-seq data}, author = {Eun Yong Kang and Lisa J Martin and Serghei Mangul and Warin Isvilanonda and Jennifer Zou and Eyal Ben-David and Buhm Han and Aldons J Lusis and Sagiv Shifman and Eleazar Eskin}, url = {https://doi.org/10.1534/genetics.115.177246}, doi = {10.1534/genetics.115.177246}, year = {2016}, date = {2016-11-01}, journal = {Genetics}, volume = {204}, number = {3}, pages = {1057--1064}, publisher = {Genetics Soc America}, abstract = {The study of the genetics of gene expression is of considerable importance to understanding the nature of common, complex diseases. The most widely applied approach to identifying relationships between genetic variation and gene expression is the expression quantitative trait loci (eQTL) approach. Here, we increased the computational power of eQTL with an alternative and complementary approach based on analyzing allele specific expression (ASE). We designed a novel analytical method to identify cis-acting regulatory variants based on genome sequencing and measurements of ASE from RNA-sequencing (RNA-seq) data. We evaluated the power and resolution of our method using simulated data. We then applied the method to map regulatory variants affecting gene expression in lymphoblastoid cell lines (LCLs) from 77 unrelated northern and western European individuals (CEU), which were part of the HapMap project. A total of 2309 SNPs were identified as being associated with ASE patterns. The SNPs associated with ASE were enriched within promoter regions and were significantly more likely to signal strong evidence for a regulatory role. Finally, among the candidate regulatory SNPs, we identified 108 SNPs that were previously associated with human immune diseases. With further improvements in quantifying ASE from RNA-seq, the application of our method to other datasets is expected to accelerate our understanding of the biological basis of common diseases.}, keywords = {}, pubstate = {published}, tppubtype = {article} } The study of the genetics of gene expression is of considerable importance to understanding the nature of common, complex diseases. The most widely applied approach to identifying relationships between genetic variation and gene expression is the expression quantitative trait loci (eQTL) approach. Here, we increased the computational power of eQTL with an alternative and complementary approach based on analyzing allele specific expression (ASE). We designed a novel analytical method to identify cis-acting regulatory variants based on genome sequencing and measurements of ASE from RNA-sequencing (RNA-seq) data. We evaluated the power and resolution of our method using simulated data. We then applied the method to map regulatory variants affecting gene expression in lymphoblastoid cell lines (LCLs) from 77 unrelated northern and western European individuals (CEU), which were part of the HapMap project. A total of 2309 SNPs were identified as being associated with ASE patterns. The SNPs associated with ASE were enriched within promoter regions and were significantly more likely to signal strong evidence for a regulatory role. Finally, among the candidate regulatory SNPs, we identified 108 SNPs that were previously associated with human immune diseases. With further improvements in quantifying ASE from RNA-seq, the application of our method to other datasets is expected to accelerate our understanding of the biological basis of common diseases. |
Mangul, Serghei; Koslicki, David Reference-free comparison of microbial communities via de Bruijn graphs Inproceedings Bioinformatics, Computational Biology and Biomedicine 2016: Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, pp. 68-77, Association for Computing Machinery 2016. Abstract | Links | BibTeX | Altmetric @inproceedings{mangul2016reference, title = {Reference-free comparison of microbial communities via de Bruijn graphs}, author = {Serghei Mangul and David Koslicki}, url = {https://doi.org/10.1145/2975167.2975174}, doi = {10.1145/2975167.2975174}, year = {2016}, date = {2016-10-20}, booktitle = {Bioinformatics, Computational Biology and Biomedicine 2016: Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics}, pages = {68-77}, organization = {Association for Computing Machinery}, abstract = {Microbial communities inhabiting the human body exhibit significant variability across different individuals and tissues, and are suggested to play an important role in health and disease. High-throughput sequencing offers unprecedented possibilities to profile microbial community composition, but limitations of existing taxonomic classification methods (including incompleteness of existing microbial reference databases) limits the ability to accurately compare microbial communities across different samples. In this paper, we present a method able to overcome these limitations by circumventing the classification step and directly using the sequencing data to compare microbial communities. The proposed method provides a powerful reference-free way to assess differences in microbial abundances across samples. This method, called EMDeBruijn, condenses the sequencing data into a de Bruijn graph. The Earth Mover's Distance (EMD) is then used to measure similarities and differences of the microbial communities associated with the individual graphs. We apply this method to RNA-Seq data sets from a coronary artery calcification (CAC) study and shown that EMDeBruijn is able to differentiate between case and control CAC samples while utilizing all the candidate microbial reads. We compare these results to current reference-based methods, which are shown to have a limited capacity to discriminate between case and control samples. We conclude that this reference-free approach is a viable choice in comparative metatranscriptomic studies.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Microbial communities inhabiting the human body exhibit significant variability across different individuals and tissues, and are suggested to play an important role in health and disease. High-throughput sequencing offers unprecedented possibilities to profile microbial community composition, but limitations of existing taxonomic classification methods (including incompleteness of existing microbial reference databases) limits the ability to accurately compare microbial communities across different samples. In this paper, we present a method able to overcome these limitations by circumventing the classification step and directly using the sequencing data to compare microbial communities. The proposed method provides a powerful reference-free way to assess differences in microbial abundances across samples. This method, called EMDeBruijn, condenses the sequencing data into a de Bruijn graph. The Earth Mover's Distance (EMD) is then used to measure similarities and differences of the microbial communities associated with the individual graphs. We apply this method to RNA-Seq data sets from a coronary artery calcification (CAC) study and shown that EMDeBruijn is able to differentiate between case and control CAC samples while utilizing all the candidate microbial reads. We compare these results to current reference-based methods, which are shown to have a limited capacity to discriminate between case and control samples. We conclude that this reference-free approach is a viable choice in comparative metatranscriptomic studies. |
Mangul, Serghei; Wu, Nicholas C; Nenastyeva, Ekaterina; Mancuso, Nicholas; Zelikovsky, Alex; Sun, Ren; Eskin, Eleazar Applications of High‐Fidelity Sequencing Protocol to RNA Viruses Book Chapter Mӑndoiu, Ion I; Zelikovsky, Alex (Ed.): Computational Methods for Next Generation Sequencing Data Analysis, pp. 85-104, Wiley Online Library, 2016. Abstract | Links | BibTeX | Altmetric @inbook{canzar2016computational, title = {Applications of High‐Fidelity Sequencing Protocol to RNA Viruses}, author = {Serghei Mangul and Nicholas C Wu and Ekaterina Nenastyeva and Nicholas Mancuso and Alex Zelikovsky and Ren Sun and Eleazar Eskin}, editor = {Ion I Mӑndoiu and Alex Zelikovsky}, url = {https://doi.org/10.1002/9781119272182.ch4}, doi = {10.1002/9781119272182.ch4}, year = {2016}, date = {2016-08-26}, booktitle = {Computational Methods for Next Generation Sequencing Data Analysis}, journal = {Computational Methods for Next Generation Sequencing Data Analysis}, pages = {85-104}, publisher = {Wiley Online Library}, abstract = {This chapter describes the used high‐fidelity sequencing protocol, and introduces the approach for viral genome assembly (VGA) based on high‐fidelity sequencing data. It presents the results of performance of VGA and some other viral assemblers on simulated data, describes the performance of VGA on real HIV data. The chapter compares different aligners to investigate the effect of their alignment on mapping statistics. Post‐sequencing error correction techniques are available for reads obtained by regular protocol offering the possibility to partially correct sequencing errors trading off for real biological mutations. HCV virus exhibits more complex genomic architecture with lower population diversity and longer conserved regions than HIV. QuasiRecomb is designed to handle paired‐end read data and manages to produce full‐length viral genomes. The chapter discusses the application of the high‐fidelity protocol that is the evaluation of error correction methods for next‐generation sequencing (NGS) reads.}, keywords = {}, pubstate = {published}, tppubtype = {inbook} } This chapter describes the used high‐fidelity sequencing protocol, and introduces the approach for viral genome assembly (VGA) based on high‐fidelity sequencing data. It presents the results of performance of VGA and some other viral assemblers on simulated data, describes the performance of VGA on real HIV data. The chapter compares different aligners to investigate the effect of their alignment on mapping statistics. Post‐sequencing error correction techniques are available for reads obtained by regular protocol offering the possibility to partially correct sequencing errors trading off for real biological mutations. HCV virus exhibits more complex genomic architecture with lower population diversity and longer conserved regions than HIV. QuasiRecomb is designed to handle paired‐end read data and manages to produce full‐length viral genomes. The chapter discusses the application of the high‐fidelity protocol that is the evaluation of error correction methods for next‐generation sequencing (NGS) reads. |
Glebova, Olga; Temate-Tiagueu, Yvette; Caciula, Adrian; Seesi, Sahar Al; Artyomenko, Alexander; Mangul, Serghei; Lindsay, James; Mӑndoiu, Ion I; Zelikovsky, Alex Transcriptome Quantification and Differential Expression from NGS Data Book Chapter Mӑndoiu, Ion I; Zelikovsky, Alex (Ed.): Computational Methods for Next Generation Sequencing Data Analysis, pp. 301-327, Wiley Online Library, 2016. Abstract | Links | BibTeX | Altmetric @inbook{glebova2016transcriptome, title = {Transcriptome Quantification and Differential Expression from NGS Data}, author = {Olga Glebova and Yvette Temate-Tiagueu and Adrian Caciula and Sahar Al Seesi and Alexander Artyomenko and Serghei Mangul and James Lindsay and Ion I Mӑndoiu and Alex Zelikovsky}, editor = {Ion I Mӑndoiu and Alex Zelikovsky}, url = {https://doi.org/10.1002/9781119272182.ch14}, doi = {10.1002/9781119272182.ch14}, year = {2016}, date = {2016-08-24}, booktitle = {Computational Methods for Next Generation Sequencing Data Analysis}, journal = {Computational Methods for Next Generation Sequencing Data Analysis}, pages = {301-327}, publisher = {Wiley Online Library}, abstract = {Transcriptome quantification analysis is crucial to determine similar transcripts or unraveling gene functions and transcription regulation mechanisms. This chapter presents a novel simulated regression‐based method for isoform frequency estimation from RNA‐Seq reads. It presents SimReg, a novel regression‐based algorithm for transcriptome quantification. Simulated data experiments demonstrate superior frequency estimation accuracy of SimReg comparatively to that of the existing tools, which tend to skew the estimated frequency toward supertranscripts. Gene expression is the process by which the genetic code (the nucleotide sequence) of a gene becomes a useful product. Important factors to consider while analyzing differentially expressed genes are normalization, accuracy of differential expression detection, and differential expression analysis when one condition has no detectable expression. RNA‐Seq is an increasingly popular approach to transcriptome profiling that uses the capabilities of next‐generation sequencing (NGS) technologies and provides better measurement of levels of transcripts and their isoforms.}, keywords = {}, pubstate = {published}, tppubtype = {inbook} } Transcriptome quantification analysis is crucial to determine similar transcripts or unraveling gene functions and transcription regulation mechanisms. This chapter presents a novel simulated regression‐based method for isoform frequency estimation from RNA‐Seq reads. It presents SimReg, a novel regression‐based algorithm for transcriptome quantification. Simulated data experiments demonstrate superior frequency estimation accuracy of SimReg comparatively to that of the existing tools, which tend to skew the estimated frequency toward supertranscripts. Gene expression is the process by which the genetic code (the nucleotide sequence) of a gene becomes a useful product. Important factors to consider while analyzing differentially expressed genes are normalization, accuracy of differential expression detection, and differential expression analysis when one condition has no detectable expression. RNA‐Seq is an increasingly popular approach to transcriptome profiling that uses the capabilities of next‐generation sequencing (NGS) technologies and provides better measurement of levels of transcripts and their isoforms. |
Schokrpur, Shiruyeh; Hu, Junhui; Moughon, Diana L; Liu, Peijun; Lin, Lucia C; Hermann, Kip; Mangul, Serghei; Guan, Wei; Pellegrini, Matteo; Xu, Hua CRISPR-mediated VHL knockout generates an improved model for metastatic renal cell carcinoma Journal Article Scientific Reports, 6 (29032), pp. 29032, 2016. Abstract | Links | BibTeX | Altmetric @article{schokrpur2016crispr, title = {CRISPR-mediated VHL knockout generates an improved model for metastatic renal cell carcinoma}, author = {Shiruyeh Schokrpur and Junhui Hu and Diana L Moughon and Peijun Liu and Lucia C Lin and Kip Hermann and Serghei Mangul and Wei Guan and Matteo Pellegrini and Hua Xu}, url = {https://doi.org/10.1038/srep29032}, doi = {10.1038/srep29032}, year = {2016}, date = {2016-07-30}, journal = {Scientific Reports}, volume = {6}, number = {29032}, pages = {29032}, publisher = {Nature Publishing Group}, abstract = {Metastatic renal cell carcinoma (mRCC) is nearly incurable and accounts for most of the mortality associated with RCC. Von Hippel Lindau (VHL) is a tumour suppressor that is lost in the majority of clear cell RCC (ccRCC) cases. Its role in regulating hypoxia-inducible factors-1α (HIF-1α) and -2α (HIF-2α) is well-studied. Recent work has demonstrated that VHL knock down induces an epithelial-mesenchymal transition (EMT) phenotype. In this study we showed that a CRISPR/Cas9-mediated knock out of VHL in the RENCA model leads to morphologic and molecular changes indicative of EMT, which in turn drives increased metastasis to the lungs. RENCA cells deficient in HIF-1α failed to undergo EMT changes upon VHL knockout. RNA-seq revealed several HIF-1α-regulated genes that are upregulated in our VHL knockout cells and whose overexpression signifies an aggressive form of ccRCC in the cancer genome atlas (TCGA) database. Independent validation in a new clinical dataset confirms the upregulation of these genes in ccRCC samples compared to adjacent normal tissue. Our findings indicate that loss of VHL could be driving tumour cell dissemination through stabilization of HIF-1α in RCC. A better understanding of the mechanisms involved in this phenomenon can guide the search for more effective treatments to combat mRCC.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Metastatic renal cell carcinoma (mRCC) is nearly incurable and accounts for most of the mortality associated with RCC. Von Hippel Lindau (VHL) is a tumour suppressor that is lost in the majority of clear cell RCC (ccRCC) cases. Its role in regulating hypoxia-inducible factors-1α (HIF-1α) and -2α (HIF-2α) is well-studied. Recent work has demonstrated that VHL knock down induces an epithelial-mesenchymal transition (EMT) phenotype. In this study we showed that a CRISPR/Cas9-mediated knock out of VHL in the RENCA model leads to morphologic and molecular changes indicative of EMT, which in turn drives increased metastasis to the lungs. RENCA cells deficient in HIF-1α failed to undergo EMT changes upon VHL knockout. RNA-seq revealed several HIF-1α-regulated genes that are upregulated in our VHL knockout cells and whose overexpression signifies an aggressive form of ccRCC in the cancer genome atlas (TCGA) database. Independent validation in a new clinical dataset confirms the upregulation of these genes in ccRCC samples compared to adjacent normal tissue. Our findings indicate that loss of VHL could be driving tumour cell dissemination through stabilization of HIF-1α in RCC. A better understanding of the mechanisms involved in this phenomenon can guide the search for more effective treatments to combat mRCC. |
Mangul, Serghei; Wu, Nicholas C; Mancuso, Nicholas; Zelikovsky, Alex; Sun, Ren; Eskin, Eleazar VGA: a method for viral quasispecies assembly from ultra-deep sequencing data Inproceedings 2014 IEEE 4th International Conference on Computational Advances in Bio and Medical Sciences (ICCABS), pp. 1–1, IEEE 2014. @inproceedings{mangul2014vga, title = {VGA: a method for viral quasispecies assembly from ultra-deep sequencing data}, author = {Serghei Mangul and Nicholas C Wu and Nicholas Mancuso and Alex Zelikovsky and Ren Sun and Eleazar Eskin}, url = {https://doi.org/10.1109/ICCABS.2014.6863932}, doi = {10.1109/ICCABS.2014.6863932}, year = {2014}, date = {2014-09-22}, booktitle = {2014 IEEE 4th International Conference on Computational Advances in Bio and Medical Sciences (ICCABS)}, pages = {1--1}, organization = {IEEE}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } |
Mangul, Serghei; Caciula, Adrian; Seesi, Sahar Al; Brinza, Dumitru; Mӑndoiu, Ion I; Zelikovsky, Alex Transcriptome assembly and quantification from Ion Torrent RNA-Seq data Journal Article BMC Genomics, 15 (5), pp. S7, 2014. Abstract | Links | BibTeX | Altmetric @article{mangul2014transcriptome, title = {Transcriptome assembly and quantification from Ion Torrent RNA-Seq data}, author = {Serghei Mangul and Adrian Caciula and Sahar Al Seesi and Dumitru Brinza and Ion I Mӑndoiu and Alex Zelikovsky}, url = {https://doi.org/10.1186/1471-2164-15-S5-S7}, doi = {10.1186/1471-2164-15-S5-S7}, year = {2014}, date = {2014-07-14}, journal = {BMC Genomics}, volume = {15}, number = {5}, pages = {S7}, publisher = {BioMed Central}, abstract = {Background High throughput RNA sequencing (RNA-Seq) can generate whole transcriptome information at the single transcript level providing a powerful tool with multiple interrelated applications including transcriptome reconstruction and quantification. The sequences of novel transcripts can be reconstructed from deep RNA-Seq data, but this is computationally challenging due to sequencing errors, uneven coverage of expressed transcripts, and the need to distinguish between highly similar transcripts produced by alternative splicing. Another challenge in transcriptomic analysis comes from the ambiguities in mapping reads to transcripts. Results We present MaLTA, a method for simultaneous transcriptome assembly and quantification from Ion Torrent RNA-Seq data. Our approach explores transcriptome structure and incorporates a maximum likelihood model into the assembly and quantification procedure. A new version of the IsoEM algorithm suitable for Ion Torrent RNA-Seq reads is used to accurately estimate transcript expression levels. The MaLTA-IsoEM tool is publicly available at: http://alan.cs.gsu.edu/NGS/?q=malta Conclusions Experimental results on both synthetic and real datasets show that Ion Torrent RNA-Seq data can be successfully used for transcriptome analyses. Experimental results suggest increased transcriptome assembly and quantification accuracy of MaLTA-IsoEM solution compared to existing state-of-the-art approaches.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Background High throughput RNA sequencing (RNA-Seq) can generate whole transcriptome information at the single transcript level providing a powerful tool with multiple interrelated applications including transcriptome reconstruction and quantification. The sequences of novel transcripts can be reconstructed from deep RNA-Seq data, but this is computationally challenging due to sequencing errors, uneven coverage of expressed transcripts, and the need to distinguish between highly similar transcripts produced by alternative splicing. Another challenge in transcriptomic analysis comes from the ambiguities in mapping reads to transcripts. Results We present MaLTA, a method for simultaneous transcriptome assembly and quantification from Ion Torrent RNA-Seq data. Our approach explores transcriptome structure and incorporates a maximum likelihood model into the assembly and quantification procedure. A new version of the IsoEM algorithm suitable for Ion Torrent RNA-Seq reads is used to accurately estimate transcript expression levels. The MaLTA-IsoEM tool is publicly available at: http://alan.cs.gsu.edu/NGS/?q=malta Conclusions Experimental results on both synthetic and real datasets show that Ion Torrent RNA-Seq data can be successfully used for transcriptome analyses. Experimental results suggest increased transcriptome assembly and quantification accuracy of MaLTA-IsoEM solution compared to existing state-of-the-art approaches. |
Mangul, Serghei; Wu, Nicholas C; Mancuso, Nicholas; Zelikovsky, Alex; Sun, Ren; Eskin, Eleazar Accurate viral population assembly from ultra-deep sequencing data Journal Article Bioinformatics, 30 (12), pp. i329–i337, 2014. Abstract | Links | BibTeX | Altmetric @article{mangul2014accurate, title = {Accurate viral population assembly from ultra-deep sequencing data}, author = {Serghei Mangul and Nicholas C Wu and Nicholas Mancuso and Alex Zelikovsky and Ren Sun and Eleazar Eskin}, url = {https://doi.org/10.1093/bioinformatics/btu295}, doi = {10.1093/bioinformatics/btu295}, year = {2014}, date = {2014-06-11}, journal = {Bioinformatics}, volume = {30}, number = {12}, pages = {i329--i337}, publisher = {Oxford University Press}, abstract = {Motivation: Next-generation sequencing technologies sequence viruses with ultra-deep coverage, thus promising to revolutionize our understanding of the underlying diversity of viral populations. While the sequencing coverage is high enough that even rare viral variants are sequenced, the presence of sequencing errors makes it difficult to distinguish between rare variants and sequencing errors. Results: In this article, we present a method to overcome the limitations of sequencing technologies and assemble a diverse viral population that allows for the detection of previously undiscovered rare variants. The proposed method consists of a high-fidelity sequencing protocol and an accurate viral population assembly method, referred to as Viral Genome Assembler (VGA). The proposed protocol is able to eliminate sequencing errors by using individual barcodes attached to the sequencing fragments. Highly accurate data in combination with deep coverage allow VGA to assemble rare variants. VGA uses an expectation–maximization algorithm to estimate abundances of the assembled viral variants in the population. Results on both synthetic and real datasets show that our method is able to accurately assemble an HIV viral population and detect rare variants previously undetectable due to sequencing errors. VGA outperforms state-of-the-art methods for genome-wide viral assembly. Furthermore, our method is the first viral assembly method that scales to millions of sequencing reads. Availability: Our tool VGA is freely available at http://genetics.cs.ucla.edu/vga/}, keywords = {}, pubstate = {published}, tppubtype = {article} } Motivation: Next-generation sequencing technologies sequence viruses with ultra-deep coverage, thus promising to revolutionize our understanding of the underlying diversity of viral populations. While the sequencing coverage is high enough that even rare viral variants are sequenced, the presence of sequencing errors makes it difficult to distinguish between rare variants and sequencing errors. Results: In this article, we present a method to overcome the limitations of sequencing technologies and assemble a diverse viral population that allows for the detection of previously undiscovered rare variants. The proposed method consists of a high-fidelity sequencing protocol and an accurate viral population assembly method, referred to as Viral Genome Assembler (VGA). The proposed protocol is able to eliminate sequencing errors by using individual barcodes attached to the sequencing fragments. Highly accurate data in combination with deep coverage allow VGA to assemble rare variants. VGA uses an expectation–maximization algorithm to estimate abundances of the assembled viral variants in the population. Results on both synthetic and real datasets show that our method is able to accurately assemble an HIV viral population and detect rare variants previously undetectable due to sequencing errors. VGA outperforms state-of-the-art methods for genome-wide viral assembly. Furthermore, our method is the first viral assembly method that scales to millions of sequencing reads. Availability: Our tool VGA is freely available at http://genetics.cs.ucla.edu/vga/ |
Mangul, Serghei; Caciula, Adrian; Glebova, Olga; Mӑndoiu, Ion I; Zelikovsky, Alex Improved transcriptome quantification and reconstruction from RNA-Seq reads using partial annotations Journal Article In Silico Biology, 11 (5, 6), pp. 251–261, 2011. Abstract | Links | BibTeX | Altmetric @article{mangul2011improved, title = {Improved transcriptome quantification and reconstruction from RNA-Seq reads using partial annotations}, author = {Serghei Mangul and Adrian Caciula and Olga Glebova and Ion I Mӑndoiu and Alex Zelikovsky}, url = {https://doi.org/10.3233/ISB-2012-0459}, doi = {10.3233/ISB-2012-0459}, year = {2011}, date = {2011-11-12}, journal = {In Silico Biology}, volume = {11}, number = {5, 6}, pages = {251--261}, publisher = {IOS press}, abstract = {The paper addresses the problem of how to use RNA-Seq data for transcriptome reconstruction and quantification, as well as novel transcript discovery in partially annotated genomes. We present a novel annotation-guided general framework for transcriptome discovery, reconstruction and quantification in partially annotated genomes and compare it with existing annotation-guided and genome-guided transcriptome assembly methods. Our method, referred as Discovery and Reconstruction of Unannotated Transcripts (DRUT), can be used to enhance existing transcriptome assemblers, such as Cufflinks [3], as well as to accurately estimate the transcript frequencies. Empirical analysis on synthetic datasets confirms that Cufflinks enhanced by DRUT has superior quality of reconstruction and frequency estimation of transcripts.}, keywords = {}, pubstate = {published}, tppubtype = {article} } The paper addresses the problem of how to use RNA-Seq data for transcriptome reconstruction and quantification, as well as novel transcript discovery in partially annotated genomes. We present a novel annotation-guided general framework for transcriptome discovery, reconstruction and quantification in partially annotated genomes and compare it with existing annotation-guided and genome-guided transcriptome assembly methods. Our method, referred as Discovery and Reconstruction of Unannotated Transcripts (DRUT), can be used to enhance existing transcriptome assemblers, such as Cufflinks [3], as well as to accurately estimate the transcript frequencies. Empirical analysis on synthetic datasets confirms that Cufflinks enhanced by DRUT has superior quality of reconstruction and frequency estimation of transcripts. |
Astrovskaya, Irina; Tork, Bassam; Mangul, Serghei; Westbrooks, Kelly; Mӑndoiu, Ion I; Balfe, Peter; Zelikovsky, Alex Inferring viral quasispecies spectra from 454 pyrosequencing reads Inproceedings BMC Bioinformatics, pp. S1, BioMed Central 2011. Abstract | Links | BibTeX | Altmetric @inproceedings{astrovskaya2011inferring, title = {Inferring viral quasispecies spectra from 454 pyrosequencing reads}, author = {Irina Astrovskaya and Bassam Tork and Serghei Mangul and Kelly Westbrooks and Ion I Mӑndoiu and Peter Balfe and Alex Zelikovsky}, url = {https://doi.org/10.1186/1471-2105-12-S6-S1}, doi = {10.1186/1471-2105-12-S6-S1}, year = {2011}, date = {2011-07-28}, booktitle = {BMC Bioinformatics}, volume = {12}, number = {6}, pages = {S1}, organization = {BioMed Central}, abstract = {Background RNA viruses infecting a host usually exist as a set of closely related sequences, referred to as quasispecies. The genomic diversity of viral quasispecies is a subject of great interest, particularly for chronic infections, since it can lead to resistance to existing therapies. High-throughput sequencing is a promising approach to characterizing viral diversity, but unfortunately standard assembly software was originally designed for single genome assembly and cannot be used to simultaneously assemble and estimate the abundance of multiple closely related quasispecies sequences. Results In this paper, we introduce a new Vi ral Sp ectrum A ssembler (ViSpA) method for quasispecies spectrum reconstruction and compare it with the state-of-the-art ShoRAH tool on both simulated and real 454 pyrosequencing shotgun reads from HCV and HIV quasispecies. Experimental results show that ViSpA outperforms ShoRAH on simulated error-free reads, correctly assembling 10 out of 10 quasispecies and 29 sequences out of 40 quasispecies. While ShoRAH has a significant advantage over ViSpA on reads simulated with sequencing errors due to its advanced error correction algorithm, ViSpA is better at assembling the simulated reads after they have been corrected by ShoRAH. ViSpA also outperforms ShoRAH on real 454 reads. Indeed, 7 most frequent sequences reconstructed by ViSpA from a real HCV dataset are viable (do not contain internal stop codons), and the most frequent sequence was within 1% of the actual open reading frame obtained by cloning and Sanger sequencing. In contrast, only one of the sequences reconstructed by ShoRAH is viable. On a real HIV dataset, ShoRAH correctly inferred only 2 quasispecies sequences with at most 4 mismatches whereas ViSpA correctly reconstructed 5 quasispecies with at most 2 mismatches, and 2 out of 5 sequences were inferred without any mismatches. ViSpA source code is available at http://alla.cs.gsu.edu/~software/VISPA/vispa.html. Conclusions ViSpA enables accurate viral quasispecies spectrum reconstruction from 454 pyrosequencing reads. We are currently exploring extensions applicable to the analysis of high-throughput sequencing data from bacterial metagenomic samples and ecological samples of eukaryote populations.}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Background RNA viruses infecting a host usually exist as a set of closely related sequences, referred to as quasispecies. The genomic diversity of viral quasispecies is a subject of great interest, particularly for chronic infections, since it can lead to resistance to existing therapies. High-throughput sequencing is a promising approach to characterizing viral diversity, but unfortunately standard assembly software was originally designed for single genome assembly and cannot be used to simultaneously assemble and estimate the abundance of multiple closely related quasispecies sequences. Results In this paper, we introduce a new Vi ral Sp ectrum A ssembler (ViSpA) method for quasispecies spectrum reconstruction and compare it with the state-of-the-art ShoRAH tool on both simulated and real 454 pyrosequencing shotgun reads from HCV and HIV quasispecies. Experimental results show that ViSpA outperforms ShoRAH on simulated error-free reads, correctly assembling 10 out of 10 quasispecies and 29 sequences out of 40 quasispecies. While ShoRAH has a significant advantage over ViSpA on reads simulated with sequencing errors due to its advanced error correction algorithm, ViSpA is better at assembling the simulated reads after they have been corrected by ShoRAH. ViSpA also outperforms ShoRAH on real 454 reads. Indeed, 7 most frequent sequences reconstructed by ViSpA from a real HCV dataset are viable (do not contain internal stop codons), and the most frequent sequence was within 1% of the actual open reading frame obtained by cloning and Sanger sequencing. In contrast, only one of the sequences reconstructed by ShoRAH is viable. On a real HIV dataset, ShoRAH correctly inferred only 2 quasispecies sequences with at most 4 mismatches whereas ViSpA correctly reconstructed 5 quasispecies with at most 2 mismatches, and 2 out of 5 sequences were inferred without any mismatches. ViSpA source code is available at http://alla.cs.gsu.edu/~software/VISPA/vispa.html. Conclusions ViSpA enables accurate viral quasispecies spectrum reconstruction from 454 pyrosequencing reads. We are currently exploring extensions applicable to the analysis of high-throughput sequencing data from bacterial metagenomic samples and ecological samples of eukaryote populations. |
Nicolae, Marius; Mangul, Serghei; Mӑndoiu, Ion I; Zelikovsky, Alex Estimation of alternative splicing isoform frequencies from RNA-Seq data Journal Article Algorithms for Molecular Biology, 6 (1), pp. 9, 2011. Abstract | Links | BibTeX | Altmetric @article{nicolae2011estimation, title = {Estimation of alternative splicing isoform frequencies from RNA-Seq data}, author = {Marius Nicolae and Serghei Mangul and Ion I Mӑndoiu and Alex Zelikovsky}, url = {https://doi.org/10.1186/1748-7188-6-9}, doi = {10.1186/1748-7188-6-9}, year = {2011}, date = {2011-04-19}, journal = {Algorithms for Molecular Biology}, volume = {6}, number = {1}, pages = {9}, publisher = {BioMed Central}, abstract = {Background Massively parallel whole transcriptome sequencing, commonly referred as RNA-Seq, is quickly becoming the technology of choice for gene expression profiling. However, due to the short read length delivered by current sequencing technologies, estimation of expression levels for alternative splicing gene isoforms remains challenging. Results In this paper we present a novel expectation-maximization algorithm for inference of isoform- and gene-specific expression levels from RNA-Seq data. Our algorithm, referred to as IsoEM, is based on disambiguating information provided by the distribution of insert sizes generated during sequencing library preparation, and takes advantage of base quality scores, strand and read pairing information when available. The open source Java implementation of IsoEM is freely available at http://dna.engr.uconn.edu/software/IsoEM/. Conclusions Empirical experiments on both synthetic and real RNA-Seq datasets show that IsoEM has scalable running time and outperforms existing methods of isoform and gene expression level estimation. Simulation experiments confirm previous findings that, for a fixed sequencing cost, using reads longer than 25-36 bases does not necessarily lead to better accuracy for estimating expression levels of annotated isoforms and genes.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Background Massively parallel whole transcriptome sequencing, commonly referred as RNA-Seq, is quickly becoming the technology of choice for gene expression profiling. However, due to the short read length delivered by current sequencing technologies, estimation of expression levels for alternative splicing gene isoforms remains challenging. Results In this paper we present a novel expectation-maximization algorithm for inference of isoform- and gene-specific expression levels from RNA-Seq data. Our algorithm, referred to as IsoEM, is based on disambiguating information provided by the distribution of insert sizes generated during sequencing library preparation, and takes advantage of base quality scores, strand and read pairing information when available. The open source Java implementation of IsoEM is freely available at http://dna.engr.uconn.edu/software/IsoEM/. Conclusions Empirical experiments on both synthetic and real RNA-Seq datasets show that IsoEM has scalable running time and outperforms existing methods of isoform and gene expression level estimation. Simulation experiments confirm previous findings that, for a fixed sequencing cost, using reads longer than 25-36 bases does not necessarily lead to better accuracy for estimating expression levels of annotated isoforms and genes. |

Lana Martin, Ph.D.
Project Specialist
lanamart@usc.edu • View Lana’s papers
Mangul Lab papers authored by Lana
Loeffler, Caitlin; Karlsberg, Aaron; Martin, Lana S; Eskin, Eleazar; Koslicki, David; Mangul, Serghei Improving the usability and comprehensiveness of microbial databases Journal Article BMC Biology, 18 (37), 2020. Abstract | Links | BibTeX | Altmetric @article{microbial2020, title = {Improving the usability and comprehensiveness of microbial databases}, author = {Caitlin Loeffler and Aaron Karlsberg and Lana S Martin and Eleazar Eskin and David Koslicki and Serghei Mangul}, url = {https://doi.org/10.1186/s12915-020-0756-z}, doi = {10.1186/s12915-020-0756-z}, year = {2020}, date = {2020-04-07}, journal = {BMC Biology}, volume = {18}, number = {37}, abstract = {Metagenomics studies leverage genomic reference databases to generate discoveries in basic science and translational research. However, current microbial studies use disparate reference databases that lack consistent standards of specimen inclusion, data preparation, taxon labelling and accessibility, hindering their quality and comprehensiveness, and calling for the establishment of recommendations for reference genome database assembly. Here, we analyze existing fungal and bacterial databases and discuss guidelines for the development of a master reference database that promises to improve the quality and quantity of omics research.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Metagenomics studies leverage genomic reference databases to generate discoveries in basic science and translational research. However, current microbial studies use disparate reference databases that lack consistent standards of specimen inclusion, data preparation, taxon labelling and accessibility, hindering their quality and comprehensiveness, and calling for the establishment of recommendations for reference genome database assembly. Here, we analyze existing fungal and bacterial databases and discuss guidelines for the development of a master reference database that promises to improve the quality and quantity of omics research. |
Mitchell, Keith; Brito, Jaqueline J; Mandric, Igor; Wu, Qiaozhen; Knyazev, Sergey; Chang, Sei; Martin, Lana S; Karlsberg, Aaron; Gerasimov, Ekaterina; Littman, Russell Jared; Hill, Brian L; Wu, Nicholas C; Yang, Harry Taegyun; Hsieh, Kevin; Chen, Linus; Littman, Eli; Shabani, Taylor; Shabanets, German; Yao, Douglas; Sun, Ren; Schroeder, Jan; Eskin, Eleazar; Zelikovsky, Alex; Skums, Pavel; Pop, Mihai; Mangul, Serghei Benchmarking of computational error-correction methods for next-generation sequencing data Journal Article Genome Biology, 21 (71), 2020. Abstract | Links | BibTeX | Altmetric @article{mitchell2019benchmarking, title = {Benchmarking of computational error-correction methods for next-generation sequencing data}, author = {Keith Mitchell and Jaqueline J Brito and Igor Mandric and Qiaozhen Wu and Sergey Knyazev and Sei Chang and Lana S Martin and Aaron Karlsberg and Ekaterina Gerasimov and Russell Jared Littman and Brian L Hill and Nicholas C Wu and Harry Taegyun Yang and Kevin Hsieh and Linus Chen and Eli Littman and Taylor Shabani and German Shabanets and Douglas Yao and Ren Sun and Jan Schroeder and Eleazar Eskin and Alex Zelikovsky and Pavel Skums and Mihai Pop and Serghei Mangul}, url = {https://doi.org/10.1186/s13059-020-01988-3}, doi = {10.1186/s13059-020-01988-3}, year = {2020}, date = {2020-03-17}, journal = {Genome Biology}, volume = {21}, number = {71}, publisher = {Cold Spring Harbor Laboratory}, abstract = {Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown. |
Brito, Jaqueline J; Mosqueiro, Thiago; Rotman, Jeremy; Xue, Victor; Chapski, Douglas J; la Hoz, Juan De; Matias, Paulo; Martin, Lana S; Zelikovsky, Alex; Pellegrini, Matteo; Mangul, Serghei Telescope: an interactive tool for managing large scale analysis from mobile devices Journal Article GigaScience, 9 (1), pp. giz163, 2020. Abstract | Links | BibTeX | Altmetric @article{Brito2019, title = {Telescope: an interactive tool for managing large scale analysis from mobile devices}, author = {Jaqueline J Brito and Thiago Mosqueiro and Jeremy Rotman and Victor Xue and Douglas J Chapski and Juan De la Hoz and Paulo Matias and Lana S Martin and Alex Zelikovsky and Matteo Pellegrini and Serghei Mangul}, url = {https://doi.org/10.1093/gigascience/giz163}, doi = {10.1093/gigascience/giz163}, year = {2020}, date = {2020-01-23}, journal = {GigaScience}, volume = {9}, number = {1}, pages = {giz163}, abstract = {Background In today's world of big data, computational analysis has become a key driver of biomedical research. High-performance computational facilities are capable of processing considerable volumes of data, yet often lack an easy-to-use interface to guide the user in supervising and adjusting bioinformatics analysis via a tablet or smartphone. Results To address this gap we proposed Telescope, a novel tool that interfaces with high-performance computational clusters to deliver an intuitive user interface for controlling and monitoring bioinformatics analyses in real-time. By leveraging last generation technology now ubiquitous to most researchers (such as smartphones), Telescope delivers a friendly user experience and manages conectivity and encryption under the hood. Conclusions Telescope helps to mitigate the digital divide between wet and computational laboratories in contemporary biology. By delivering convenience and ease of use through a user experience not relying on expertise with computational clusters, Telescope can help researchers close the feedback loop between bioinformatics and experimental work with minimal impact on the performance of computational tools. Telescope is freely available at https://github.com/Mangul-Lab-USC/telescope.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Background In today's world of big data, computational analysis has become a key driver of biomedical research. High-performance computational facilities are capable of processing considerable volumes of data, yet often lack an easy-to-use interface to guide the user in supervising and adjusting bioinformatics analysis via a tablet or smartphone. Results To address this gap we proposed Telescope, a novel tool that interfaces with high-performance computational clusters to deliver an intuitive user interface for controlling and monitoring bioinformatics analyses in real-time. By leveraging last generation technology now ubiquitous to most researchers (such as smartphones), Telescope delivers a friendly user experience and manages conectivity and encryption under the hood. Conclusions Telescope helps to mitigate the digital divide between wet and computational laboratories in contemporary biology. By delivering convenience and ease of use through a user experience not relying on expertise with computational clusters, Telescope can help researchers close the feedback loop between bioinformatics and experimental work with minimal impact on the performance of computational tools. Telescope is freely available at https://github.com/Mangul-Lab-USC/telescope. |
Loeffler, Caitlin; Gibson, Keylie M; Martin, Lana S; Chang, Yutong; Rotman, Jeremy; Toma, Ian V; Mason, Christopher E; Eskin, Eleazar; Zackular, Joseph P; Crandall, Keith A; Koslicki, David; Mangul, Serghei Metagenomics for clinical diagnostics: technologies and informatics Journal Article arXiv, 2019. Abstract | Links | BibTeX | Altmetric @article{Loeffler2019b, title = {Metagenomics for clinical diagnostics: technologies and informatics}, author = {Caitlin Loeffler and Keylie M Gibson and Lana S Martin and Yutong Chang and Jeremy Rotman and Ian V Toma and Christopher E Mason and Eleazar Eskin and Joseph P Zackular and Keith A Crandall and David Koslicki and Serghei Mangul}, url = {https://arxiv.org/abs/1911.11304}, doi = {1911.11304}, year = {2019}, date = {2019-11-25}, journal = {arXiv}, abstract = {The human-associated microbiome is closely tied to human health and is of substantial clinical interest. Metagenomics-based tools are emerging for clinical diagnostics, tracking the spread of diseases, and surveillance of potential pathogens. In some cases, these tools are overcoming limitations of traditional clinical approaches. Metagenomics has limitations barring the tools from clinical validation. Once these hurdles are overcome, clinical metagenomics will inform doctors of the best, targeted treatment for their patients and provide early detection of disease. Here we present an overview of metagenomics methods with a discussion of computational challenges and limitations.}, keywords = {}, pubstate = {published}, tppubtype = {article} } The human-associated microbiome is closely tied to human health and is of substantial clinical interest. Metagenomics-based tools are emerging for clinical diagnostics, tracking the spread of diseases, and surveillance of potential pathogens. In some cases, these tools are overcoming limitations of traditional clinical approaches. Metagenomics has limitations barring the tools from clinical validation. Once these hurdles are overcome, clinical metagenomics will inform doctors of the best, targeted treatment for their patients and provide early detection of disease. Here we present an overview of metagenomics methods with a discussion of computational challenges and limitations. |
Mangul, Serghei; Martin, Lana S; Hill, Brian L; Lam, Angela Ka-Mei; Distler, Margaret G; Zelikovsky, Alex; Eskin, Eleazar; Flint, Jonathan Systematic benchmarking of omics computational tools Journal Article Nature Communications, 10 (1393), pp. 1-11, 2019. Abstract | Links | BibTeX | Altmetric @article{mangul2019systematic, title = {Systematic benchmarking of omics computational tools}, author = {Serghei Mangul and Lana S Martin and Brian L Hill and Angela Ka-Mei Lam and Margaret G Distler and Alex Zelikovsky and Eleazar Eskin and Jonathan Flint}, url = {https://doi.org/10.1038/s41467-019-09406-4}, doi = {10.1038/s41467-019-09406-4}, year = {2019}, date = {2019-03-27}, journal = {Nature Communications}, volume = {10}, number = {1393}, pages = {1-11}, publisher = {Nature Publishing Group doi=10.1038/s41467-019-09406-4 url=https://doi.org/10.1038/s41467-019-09406-4}, abstract = {Computational omics methods packaged as software have become essential to modern biological research. The increasing dependence of scientists on these powerful software tools creates a need for systematic assessment of these methods, known as benchmarking. Adopting a standardized benchmarking practice could help researchers who use omics data to better leverage recent technological innovations. Our review summarizes benchmarking practices from 25 recent studies and discusses the challenges, advantages, and limitations of benchmarking across various domains of biology. We also propose principles that can make computational biology benchmarking studies more sustainable and reproducible, ultimately increasing the transparency of biomedical data and results.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Computational omics methods packaged as software have become essential to modern biological research. The increasing dependence of scientists on these powerful software tools creates a need for systematic assessment of these methods, known as benchmarking. Adopting a standardized benchmarking practice could help researchers who use omics data to better leverage recent technological innovations. Our review summarizes benchmarking practices from 25 recent studies and discusses the challenges, advantages, and limitations of benchmarking across various domains of biology. We also propose principles that can make computational biology benchmarking studies more sustainable and reproducible, ultimately increasing the transparency of biomedical data and results. |
Mangul, Serghei; Martin, Lana S; Langmead, Ben; Sanchez-Galan, Javier E; Toma, Ian V; Hormozdiari, Fereydoun; Pevzner, Pavel; Eskin, Eleazar How bioinformatics and open data can boost basic science in countries and universities with limited resources Journal Article Nature Biotechnology, 37 (3), pp. 324, 2019. Abstract | Links | BibTeX | Altmetric @article{mangul2019bioinformatics, title = {How bioinformatics and open data can boost basic science in countries and universities with limited resources}, author = {Serghei Mangul and Lana S Martin and Ben Langmead and Javier E Sanchez-Galan and Ian V Toma and Fereydoun Hormozdiari and Pavel Pevzner and Eleazar Eskin}, url = {https://doi.org/10.1038/s41587-019-0053-y}, doi = {10.1038/s41587-019-0053-y}, year = {2019}, date = {2019-03-04}, journal = {Nature Biotechnology}, volume = {37}, number = {3}, pages = {324}, publisher = {Nature Publishing Group}, abstract = {Providing training and access to standard computing hardware and cloud-based resources can enable scientists in lower-resource institutions and countries to reanalyze published ‘-omics’ data and produce career-enhancing STEM research.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Providing training and access to standard computing hardware and cloud-based resources can enable scientists in lower-resource institutions and countries to reanalyze published ‘-omics’ data and produce career-enhancing STEM research. |
Mangul, Serghei; Martin, Lana S; Eskin, Eleazar; Blekhman, Ran Improving the usability and archival stability of bioinformatics software Journal Article Genome Biology, 20 (47), pp. 1-3, 2019. Abstract | Links | BibTeX | Altmetric @article{mangul2019improving, title = {Improving the usability and archival stability of bioinformatics software}, author = {Serghei Mangul and Lana S Martin and Eleazar Eskin and Ran Blekhman}, url = {https://doi.org/10.1186/s13059-019-1649-8}, doi = {10.1186/s13059-019-1649-8}, year = {2019}, date = {2019-02-27}, journal = {Genome Biology}, volume = {20}, number = {47}, pages = {1-3}, publisher = {BioMed Central doi=10.1186/s13059-019-1649-8 url=https://doi.org/10.1186/s13059-019-1649-8}, abstract = {Implementation of bioinformatics software involves numerous unique challenges; a rigorous standardized approach is needed to examine software tools prior to their publication.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Implementation of bioinformatics software involves numerous unique challenges; a rigorous standardized approach is needed to examine software tools prior to their publication. |
Mangul, Serghei; Martin, Lana S; Eskin, Eleazar Involving undergraduates in genomics research to narrow the education--research gap Journal Article Nature Biotechnology, 36 (4), pp. 369, 2018. Abstract | Links | BibTeX | Altmetric @article{mangul2018involving, title = {Involving undergraduates in genomics research to narrow the education--research gap}, author = {Serghei Mangul and Lana S Martin and Eleazar Eskin}, url = {https://doi.org/10.1038/nbt.4113}, doi = {10.1038/nbt.4113}, year = {2018}, date = {2018-04-05}, journal = {Nature Biotechnology}, volume = {36}, number = {4}, pages = {369}, publisher = {Nature Publishing Group}, abstract = {Engaging undergraduates in computational tasks can improve genomic research laboratory productivity, benefiting both students and senior laboratory members.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Engaging undergraduates in computational tasks can improve genomic research laboratory productivity, benefiting both students and senior laboratory members. |
Mangul, Serghei; Martin, Lana S; Hoffmann, Alexander; Pellegrini, Matteo; Eskin, Eleazar Addressing the digital divide in contemporary biology: lessons from teaching UNIX Journal Article Trends in Biotechnology, 35 (10), pp. 901–903, 2017. Abstract | Links | BibTeX | Altmetric @article{mangul2017addressing, title = {Addressing the digital divide in contemporary biology: lessons from teaching UNIX}, author = {Serghei Mangul and Lana S Martin and Alexander Hoffmann and Matteo Pellegrini and Eleazar Eskin}, url = {https://doi.org/10.1016/j.tibtech.2017.06.007}, doi = {10.1016/j.tibtech.2017.06.007}, year = {2017}, date = {2017-07-15}, journal = {Trends in Biotechnology}, volume = {35}, number = {10}, pages = {901--903}, publisher = {Elsevier}, abstract = {Life and medical science researchers increasingly rely on applications that lack a graphical interface. Scientists who are not trained in computer science face an enormous challenge analyzing high-throughput data. We present a training model for use of command-line tools when the learner has little to no prior knowledge of UNIX.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Life and medical science researchers increasingly rely on applications that lack a graphical interface. Scientists who are not trained in computer science face an enormous challenge analyzing high-throughput data. We present a training model for use of command-line tools when the learner has little to no prior knowledge of UNIX. |
Mangul, Serghei; Driesche, Sarah Van; Martin, Lana S; Martin, Kelsey C; Eskin, Eleazar UMI-Reducer: Collapsing duplicate sequencing reads via Unique Molecular Identifiers Journal Article bioRxiv, 2017. Abstract | Links | BibTeX | Altmetric @article{mangul2017umi, title = {UMI-Reducer: Collapsing duplicate sequencing reads via Unique Molecular Identifiers}, author = {Serghei Mangul and Sarah Van Driesche and Lana S Martin and Kelsey C Martin and Eleazar Eskin}, url = {https://doi.org/10.1101/103267}, doi = {10.1101/103267}, year = {2017}, date = {2017-01-25}, journal = {bioRxiv}, publisher = {Cold Spring Harbor Laboratory}, abstract = {Every sequencing library contains duplicate reads. While many duplicates arise during polymerase chain reaction (PCR), some duplicates derive from multiple identical fragments of mRNA present in the original lysate (termed “biological duplicates”). Unique Molecular Identifiers (UMIs) are random oligonucleotide sequences that allow differentiation between technical and biological duplicates. Here we report the development of UMI-Reducer, a new computational tool for processing and differentiating PCR duplicates from biological duplicates. UMI-Reducer uses UMIs and the mapping position of the read to identify and collapse reads that are technical duplicates. Remaining true biological reads are further used for bias-free estimate of mRNA abundance in the original lysate. This strategy is of particular use for libraries made from low amounts of starting material, which typically require additional cycles of PCR and therefore are most prone to PCR duplicate bias.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Every sequencing library contains duplicate reads. While many duplicates arise during polymerase chain reaction (PCR), some duplicates derive from multiple identical fragments of mRNA present in the original lysate (termed “biological duplicates”). Unique Molecular Identifiers (UMIs) are random oligonucleotide sequences that allow differentiation between technical and biological duplicates. Here we report the development of UMI-Reducer, a new computational tool for processing and differentiating PCR duplicates from biological duplicates. UMI-Reducer uses UMIs and the mapping position of the read to identify and collapse reads that are technical duplicates. Remaining true biological reads are further used for bias-free estimate of mRNA abundance in the original lysate. This strategy is of particular use for libraries made from low amounts of starting material, which typically require additional cycles of PCR and therefore are most prone to PCR duplicate bias. |

Jacqueline (Jaque) Brito, Ph.D.
Postdoctoral Scholar
britoj@usc.edu • View Jaque’s papers
Mangul Lab papers authored by Jaque
Brito, Jaqueline J; Li, Jun; Moore, Jason H; Greene, Casey S; Nogoy, Nicole A; Garmire, Lana X; Mangul, Serghei Recommendations to enhance rigor and reproducibility in biomedical research Journal Article GigaScience, 9 (6), pp. giaa056, 2020. Abstract | Links | BibTeX | Altmetric @article{Brito2020, title = {Recommendations to enhance rigor and reproducibility in biomedical research}, author = {Jaqueline J Brito and Jun Li and Jason H Moore and Casey S Greene and Nicole A Nogoy and Lana X Garmire and Serghei Mangul}, url = {https://doi.org/10.1093/gigascience/giaa056}, doi = {10.1093/gigascience/giaa056}, year = {2020}, date = {2020-06-01}, journal = {GigaScience}, volume = {9}, number = {6}, pages = {giaa056}, abstract = {Biomedical research depends increasingly on computational tools, but mechanisms ensuring open data, open software, and reproducibility are variably enforced by academic institutions, funders, and publishers. Publications may present software for which source code or documentation are or become unavailable; this compromises the role of peer review in evaluating technical strength and scientific contribution. Incomplete ancillary information for an academic software package may bias or limit subsequent work. We provide 8 recommendations to improve reproducibility, transparency, and rigor in computational biology—precisely the values that should be emphasized in life science curricula. Our recommendations for improving software availability, usability, and archival stability aim to foster a sustainable data science ecosystem in life science research.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Biomedical research depends increasingly on computational tools, but mechanisms ensuring open data, open software, and reproducibility are variably enforced by academic institutions, funders, and publishers. Publications may present software for which source code or documentation are or become unavailable; this compromises the role of peer review in evaluating technical strength and scientific contribution. Incomplete ancillary information for an academic software package may bias or limit subsequent work. We provide 8 recommendations to improve reproducibility, transparency, and rigor in computational biology—precisely the values that should be emphasized in life science curricula. Our recommendations for improving software availability, usability, and archival stability aim to foster a sustainable data science ecosystem in life science research. |
Mitchell, Keith; Brito, Jaqueline J; Mandric, Igor; Wu, Qiaozhen; Knyazev, Sergey; Chang, Sei; Martin, Lana S; Karlsberg, Aaron; Gerasimov, Ekaterina; Littman, Russell Jared; Hill, Brian L; Wu, Nicholas C; Yang, Harry Taegyun; Hsieh, Kevin; Chen, Linus; Littman, Eli; Shabani, Taylor; Shabanets, German; Yao, Douglas; Sun, Ren; Schroeder, Jan; Eskin, Eleazar; Zelikovsky, Alex; Skums, Pavel; Pop, Mihai; Mangul, Serghei Benchmarking of computational error-correction methods for next-generation sequencing data Journal Article Genome Biology, 21 (71), 2020. Abstract | Links | BibTeX | Altmetric @article{mitchell2019benchmarking, title = {Benchmarking of computational error-correction methods for next-generation sequencing data}, author = {Keith Mitchell and Jaqueline J Brito and Igor Mandric and Qiaozhen Wu and Sergey Knyazev and Sei Chang and Lana S Martin and Aaron Karlsberg and Ekaterina Gerasimov and Russell Jared Littman and Brian L Hill and Nicholas C Wu and Harry Taegyun Yang and Kevin Hsieh and Linus Chen and Eli Littman and Taylor Shabani and German Shabanets and Douglas Yao and Ren Sun and Jan Schroeder and Eleazar Eskin and Alex Zelikovsky and Pavel Skums and Mihai Pop and Serghei Mangul}, url = {https://doi.org/10.1186/s13059-020-01988-3}, doi = {10.1186/s13059-020-01988-3}, year = {2020}, date = {2020-03-17}, journal = {Genome Biology}, volume = {21}, number = {71}, publisher = {Cold Spring Harbor Laboratory}, abstract = {Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown. |
Brito, Jaqueline J; Mosqueiro, Thiago; Rotman, Jeremy; Xue, Victor; Chapski, Douglas J; la Hoz, Juan De; Matias, Paulo; Martin, Lana S; Zelikovsky, Alex; Pellegrini, Matteo; Mangul, Serghei Telescope: an interactive tool for managing large scale analysis from mobile devices Journal Article GigaScience, 9 (1), pp. giz163, 2020. Abstract | Links | BibTeX | Altmetric @article{Brito2019, title = {Telescope: an interactive tool for managing large scale analysis from mobile devices}, author = {Jaqueline J Brito and Thiago Mosqueiro and Jeremy Rotman and Victor Xue and Douglas J Chapski and Juan De la Hoz and Paulo Matias and Lana S Martin and Alex Zelikovsky and Matteo Pellegrini and Serghei Mangul}, url = {https://doi.org/10.1093/gigascience/giz163}, doi = {10.1093/gigascience/giz163}, year = {2020}, date = {2020-01-23}, journal = {GigaScience}, volume = {9}, number = {1}, pages = {giz163}, abstract = {Background In today's world of big data, computational analysis has become a key driver of biomedical research. High-performance computational facilities are capable of processing considerable volumes of data, yet often lack an easy-to-use interface to guide the user in supervising and adjusting bioinformatics analysis via a tablet or smartphone. Results To address this gap we proposed Telescope, a novel tool that interfaces with high-performance computational clusters to deliver an intuitive user interface for controlling and monitoring bioinformatics analyses in real-time. By leveraging last generation technology now ubiquitous to most researchers (such as smartphones), Telescope delivers a friendly user experience and manages conectivity and encryption under the hood. Conclusions Telescope helps to mitigate the digital divide between wet and computational laboratories in contemporary biology. By delivering convenience and ease of use through a user experience not relying on expertise with computational clusters, Telescope can help researchers close the feedback loop between bioinformatics and experimental work with minimal impact on the performance of computational tools. Telescope is freely available at https://github.com/Mangul-Lab-USC/telescope.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Background In today's world of big data, computational analysis has become a key driver of biomedical research. High-performance computational facilities are capable of processing considerable volumes of data, yet often lack an easy-to-use interface to guide the user in supervising and adjusting bioinformatics analysis via a tablet or smartphone. Results To address this gap we proposed Telescope, a novel tool that interfaces with high-performance computational clusters to deliver an intuitive user interface for controlling and monitoring bioinformatics analyses in real-time. By leveraging last generation technology now ubiquitous to most researchers (such as smartphones), Telescope delivers a friendly user experience and manages conectivity and encryption under the hood. Conclusions Telescope helps to mitigate the digital divide between wet and computational laboratories in contemporary biology. By delivering convenience and ease of use through a user experience not relying on expertise with computational clusters, Telescope can help researchers close the feedback loop between bioinformatics and experimental work with minimal impact on the performance of computational tools. Telescope is freely available at https://github.com/Mangul-Lab-USC/telescope. |
Mangul, Serghei; Mosqueiro, Thiago; Abdill, Richard J; Duong, Dat; Mitchell, Keith; Sarwal, Varuni; Hill, Brian L; Brito, Jaqueline J; Littman, Russell Jared; Statz, Benjamin Challenges and recommendations to improve the installability and archival stability of omics computational tools Journal Article PLoS Biology, 17 (6), pp. e3000333, 2019. Abstract | Links | BibTeX | Altmetric @article{mangul2019challenges, title = {Challenges and recommendations to improve the installability and archival stability of omics computational tools}, author = {Serghei Mangul and Thiago Mosqueiro and Richard J Abdill and Dat Duong and Keith Mitchell and Varuni Sarwal and Brian L Hill and Jaqueline J Brito and Russell Jared Littman and Benjamin Statz}, url = {https://doi.org/10.1371/journal.pbio.3000333}, doi = {10.1371/journal.pbio.3000333}, year = {2019}, date = {2019-06-20}, journal = {PLoS Biology}, volume = {17}, number = {6}, pages = {e3000333}, publisher = {Public Library of Science}, abstract = {Developing new software tools for analysis of large-scale biological data is a key component of advancing modern biomedical research. Scientific reproduction of published findings requires running computational tools on data generated by such studies, yet little attention is presently allocated to the installability and archival stability of computational software tools. Scientific journals require data and code sharing, but none currently require authors to guarantee the continuing functionality of newly published tools. We have estimated the archival stability of computational biology software tools by performing an empirical analysis of the internet presence for 36,702 omics software resources published from 2005 to 2017. We found that almost 28% of all resources are currently not accessible through uniform resource locators (URLs) published in the paper they first appeared in. Among the 98 software tools selected for our installability test, 51% were deemed “easy to install,” and 28% of the tools failed to be installed at all because of problems in the implementation. Moreover, for papers introducing new software, we found that the number of citations significantly increased when authors provided an easy installation process. We propose for incorporation into journal policy several practical solutions for increasing the widespread installability and archival stability of published bioinformatics software.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Developing new software tools for analysis of large-scale biological data is a key component of advancing modern biomedical research. Scientific reproduction of published findings requires running computational tools on data generated by such studies, yet little attention is presently allocated to the installability and archival stability of computational software tools. Scientific journals require data and code sharing, but none currently require authors to guarantee the continuing functionality of newly published tools. We have estimated the archival stability of computational biology software tools by performing an empirical analysis of the internet presence for 36,702 omics software resources published from 2005 to 2017. We found that almost 28% of all resources are currently not accessible through uniform resource locators (URLs) published in the paper they first appeared in. Among the 98 software tools selected for our installability test, 51% were deemed “easy to install,” and 28% of the tools failed to be installed at all because of problems in the implementation. Moreover, for papers introducing new software, we found that the number of citations significantly increased when authors provided an easy installation process. We propose for incorporation into journal policy several practical solutions for increasing the widespread installability and archival stability of published bioinformatics software. |

Angela Lu, M.S.
Ph.D. Student
alu52904@usc.edu • View Angela’s papers
Mangul Lab papers authored by Angela
Sarwal, Varuni; Niehus, Sebastian; Ayyala, Ram; Chang, Sei; Lu, Angela; Darci-Maher, Nicholas; Littman, Russell Jared; Wesel, Emily; Castellanos, Jacqueline; Chikka, Rahul; Distler, Margaret G; Eskin, Eleazar; Flint, Jonathan; Mangul, Serghei A comprehensive benchmarking of WGS-based structural variant callers Journal Article bioRxiv, 2020. Abstract | Links | BibTeX | Altmetric @article{Sarwal2020, title = {A comprehensive benchmarking of WGS-based structural variant callers}, author = {Varuni Sarwal and Sebastian Niehus and Ram Ayyala and Sei Chang and Angela Lu and Nicholas Darci-Maher and Russell Jared Littman and Emily Wesel and Jacqueline Castellanos and Rahul Chikka and Margaret G Distler and Eleazar Eskin and Jonathan Flint and Serghei Mangul}, url = {https://www.biorxiv.org/content/10.1101/2020.04.16.045120v1}, doi = {10.1101/2020.04.16.045120}, year = {2020}, date = {2020-04-18}, journal = {bioRxiv}, abstract = {Advances in whole genome sequencing promise to enable the accurate and comprehensive structural variant (SV) discovery. Dissecting SVs from whole genome sequencing (WGS) data presents a substantial number of challenges and a plethora of SV-detection methods have been developed. Currently, there is a paucity of evidence which investigators can use to select appropriate SV-detection tools. In this paper, we evaluated the performance of SV-detection tools using a comprehensive PCR-confirmed gold standard set of SVs. In contrast to the previous benchmarking studies, our gold standard dataset included a complete set of SVs allowing us to report both precision and sensitivity rates of SV-detection methods. Our study investigates the ability of the methods to detect deletions, thus providing an optimistic estimate of SV detection performance, as the SV-detection methods that fail to detect deletions are likely to miss more complex SVs. We found that SV-detection tools varied widely in their performance, with several methods providing a good balance between sensitivity and precision. Additionally, we have determined the SV callers best suited for low and ultra-low pass sequencing data.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Advances in whole genome sequencing promise to enable the accurate and comprehensive structural variant (SV) discovery. Dissecting SVs from whole genome sequencing (WGS) data presents a substantial number of challenges and a plethora of SV-detection methods have been developed. Currently, there is a paucity of evidence which investigators can use to select appropriate SV-detection tools. In this paper, we evaluated the performance of SV-detection tools using a comprehensive PCR-confirmed gold standard set of SVs. In contrast to the previous benchmarking studies, our gold standard dataset included a complete set of SVs allowing us to report both precision and sensitivity rates of SV-detection methods. Our study investigates the ability of the methods to detect deletions, thus providing an optimistic estimate of SV detection performance, as the SV-detection methods that fail to detect deletions are likely to miss more complex SVs. We found that SV-detection tools varied widely in their performance, with several methods providing a good balance between sensitivity and precision. Additionally, we have determined the SV callers best suited for low and ultra-low pass sequencing data. |

Yutong (Liz) Chang, B.S.
Master’s Student
yutongch@usc.edu • View Liz’s papers
Mangul Lab papers authored by Liz
Loeffler, Caitlin; Gibson, Keylie M; Martin, Lana S; Chang, Yutong; Rotman, Jeremy; Toma, Ian V; Mason, Christopher E; Eskin, Eleazar; Zackular, Joseph P; Crandall, Keith A; Koslicki, David; Mangul, Serghei Metagenomics for clinical diagnostics: technologies and informatics Journal Article arXiv, 2019. Abstract | Links | BibTeX | Altmetric @article{Loeffler2019b, title = {Metagenomics for clinical diagnostics: technologies and informatics}, author = {Caitlin Loeffler and Keylie M Gibson and Lana S Martin and Yutong Chang and Jeremy Rotman and Ian V Toma and Christopher E Mason and Eleazar Eskin and Joseph P Zackular and Keith A Crandall and David Koslicki and Serghei Mangul}, url = {https://arxiv.org/abs/1911.11304}, doi = {1911.11304}, year = {2019}, date = {2019-11-25}, journal = {arXiv}, abstract = {The human-associated microbiome is closely tied to human health and is of substantial clinical interest. Metagenomics-based tools are emerging for clinical diagnostics, tracking the spread of diseases, and surveillance of potential pathogens. In some cases, these tools are overcoming limitations of traditional clinical approaches. Metagenomics has limitations barring the tools from clinical validation. Once these hurdles are overcome, clinical metagenomics will inform doctors of the best, targeted treatment for their patients and provide early detection of disease. Here we present an overview of metagenomics methods with a discussion of computational challenges and limitations.}, keywords = {}, pubstate = {published}, tppubtype = {article} } The human-associated microbiome is closely tied to human health and is of substantial clinical interest. Metagenomics-based tools are emerging for clinical diagnostics, tracking the spread of diseases, and surveillance of potential pathogens. In some cases, these tools are overcoming limitations of traditional clinical approaches. Metagenomics has limitations barring the tools from clinical validation. Once these hurdles are overcome, clinical metagenomics will inform doctors of the best, targeted treatment for their patients and provide early detection of disease. Here we present an overview of metagenomics methods with a discussion of computational challenges and limitations. |

Ram Ayyala
Undergraduate Researcher
ramayyala@gmail.com • View Ram’s papers
Mangul Lab papers authored by Ram
Sarwal, Varuni; Niehus, Sebastian; Ayyala, Ram; Chang, Sei; Lu, Angela; Darci-Maher, Nicholas; Littman, Russell Jared; Wesel, Emily; Castellanos, Jacqueline; Chikka, Rahul; Distler, Margaret G; Eskin, Eleazar; Flint, Jonathan; Mangul, Serghei A comprehensive benchmarking of WGS-based structural variant callers Journal Article bioRxiv, 2020. Abstract | Links | BibTeX | Altmetric @article{Sarwal2020, title = {A comprehensive benchmarking of WGS-based structural variant callers}, author = {Varuni Sarwal and Sebastian Niehus and Ram Ayyala and Sei Chang and Angela Lu and Nicholas Darci-Maher and Russell Jared Littman and Emily Wesel and Jacqueline Castellanos and Rahul Chikka and Margaret G Distler and Eleazar Eskin and Jonathan Flint and Serghei Mangul}, url = {https://www.biorxiv.org/content/10.1101/2020.04.16.045120v1}, doi = {10.1101/2020.04.16.045120}, year = {2020}, date = {2020-04-18}, journal = {bioRxiv}, abstract = {Advances in whole genome sequencing promise to enable the accurate and comprehensive structural variant (SV) discovery. Dissecting SVs from whole genome sequencing (WGS) data presents a substantial number of challenges and a plethora of SV-detection methods have been developed. Currently, there is a paucity of evidence which investigators can use to select appropriate SV-detection tools. In this paper, we evaluated the performance of SV-detection tools using a comprehensive PCR-confirmed gold standard set of SVs. In contrast to the previous benchmarking studies, our gold standard dataset included a complete set of SVs allowing us to report both precision and sensitivity rates of SV-detection methods. Our study investigates the ability of the methods to detect deletions, thus providing an optimistic estimate of SV detection performance, as the SV-detection methods that fail to detect deletions are likely to miss more complex SVs. We found that SV-detection tools varied widely in their performance, with several methods providing a good balance between sensitivity and precision. Additionally, we have determined the SV callers best suited for low and ultra-low pass sequencing data.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Advances in whole genome sequencing promise to enable the accurate and comprehensive structural variant (SV) discovery. Dissecting SVs from whole genome sequencing (WGS) data presents a substantial number of challenges and a plethora of SV-detection methods have been developed. Currently, there is a paucity of evidence which investigators can use to select appropriate SV-detection tools. In this paper, we evaluated the performance of SV-detection tools using a comprehensive PCR-confirmed gold standard set of SVs. In contrast to the previous benchmarking studies, our gold standard dataset included a complete set of SVs allowing us to report both precision and sensitivity rates of SV-detection methods. Our study investigates the ability of the methods to detect deletions, thus providing an optimistic estimate of SV detection performance, as the SV-detection methods that fail to detect deletions are likely to miss more complex SVs. We found that SV-detection tools varied widely in their performance, with several methods providing a good balance between sensitivity and precision. Additionally, we have determined the SV callers best suited for low and ultra-low pass sequencing data. |

Sei Chang
Undergraduate Researcher
seichang00@g.ucla.edu • View Sei’s papers
Mangul Lab papers authored by Sei
Sarwal, Varuni; Niehus, Sebastian; Ayyala, Ram; Chang, Sei; Lu, Angela; Darci-Maher, Nicholas; Littman, Russell Jared; Wesel, Emily; Castellanos, Jacqueline; Chikka, Rahul; Distler, Margaret G; Eskin, Eleazar; Flint, Jonathan; Mangul, Serghei A comprehensive benchmarking of WGS-based structural variant callers Journal Article bioRxiv, 2020. Abstract | Links | BibTeX | Altmetric @article{Sarwal2020, title = {A comprehensive benchmarking of WGS-based structural variant callers}, author = {Varuni Sarwal and Sebastian Niehus and Ram Ayyala and Sei Chang and Angela Lu and Nicholas Darci-Maher and Russell Jared Littman and Emily Wesel and Jacqueline Castellanos and Rahul Chikka and Margaret G Distler and Eleazar Eskin and Jonathan Flint and Serghei Mangul}, url = {https://www.biorxiv.org/content/10.1101/2020.04.16.045120v1}, doi = {10.1101/2020.04.16.045120}, year = {2020}, date = {2020-04-18}, journal = {bioRxiv}, abstract = {Advances in whole genome sequencing promise to enable the accurate and comprehensive structural variant (SV) discovery. Dissecting SVs from whole genome sequencing (WGS) data presents a substantial number of challenges and a plethora of SV-detection methods have been developed. Currently, there is a paucity of evidence which investigators can use to select appropriate SV-detection tools. In this paper, we evaluated the performance of SV-detection tools using a comprehensive PCR-confirmed gold standard set of SVs. In contrast to the previous benchmarking studies, our gold standard dataset included a complete set of SVs allowing us to report both precision and sensitivity rates of SV-detection methods. Our study investigates the ability of the methods to detect deletions, thus providing an optimistic estimate of SV detection performance, as the SV-detection methods that fail to detect deletions are likely to miss more complex SVs. We found that SV-detection tools varied widely in their performance, with several methods providing a good balance between sensitivity and precision. Additionally, we have determined the SV callers best suited for low and ultra-low pass sequencing data.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Advances in whole genome sequencing promise to enable the accurate and comprehensive structural variant (SV) discovery. Dissecting SVs from whole genome sequencing (WGS) data presents a substantial number of challenges and a plethora of SV-detection methods have been developed. Currently, there is a paucity of evidence which investigators can use to select appropriate SV-detection tools. In this paper, we evaluated the performance of SV-detection tools using a comprehensive PCR-confirmed gold standard set of SVs. In contrast to the previous benchmarking studies, our gold standard dataset included a complete set of SVs allowing us to report both precision and sensitivity rates of SV-detection methods. Our study investigates the ability of the methods to detect deletions, thus providing an optimistic estimate of SV detection performance, as the SV-detection methods that fail to detect deletions are likely to miss more complex SVs. We found that SV-detection tools varied widely in their performance, with several methods providing a good balance between sensitivity and precision. Additionally, we have determined the SV callers best suited for low and ultra-low pass sequencing data. |
Mitchell, Keith; Brito, Jaqueline J; Mandric, Igor; Wu, Qiaozhen; Knyazev, Sergey; Chang, Sei; Martin, Lana S; Karlsberg, Aaron; Gerasimov, Ekaterina; Littman, Russell Jared; Hill, Brian L; Wu, Nicholas C; Yang, Harry Taegyun; Hsieh, Kevin; Chen, Linus; Littman, Eli; Shabani, Taylor; Shabanets, German; Yao, Douglas; Sun, Ren; Schroeder, Jan; Eskin, Eleazar; Zelikovsky, Alex; Skums, Pavel; Pop, Mihai; Mangul, Serghei Benchmarking of computational error-correction methods for next-generation sequencing data Journal Article Genome Biology, 21 (71), 2020. Abstract | Links | BibTeX | Altmetric @article{mitchell2019benchmarking, title = {Benchmarking of computational error-correction methods for next-generation sequencing data}, author = {Keith Mitchell and Jaqueline J Brito and Igor Mandric and Qiaozhen Wu and Sergey Knyazev and Sei Chang and Lana S Martin and Aaron Karlsberg and Ekaterina Gerasimov and Russell Jared Littman and Brian L Hill and Nicholas C Wu and Harry Taegyun Yang and Kevin Hsieh and Linus Chen and Eli Littman and Taylor Shabani and German Shabanets and Douglas Yao and Ren Sun and Jan Schroeder and Eleazar Eskin and Alex Zelikovsky and Pavel Skums and Mihai Pop and Serghei Mangul}, url = {https://doi.org/10.1186/s13059-020-01988-3}, doi = {10.1186/s13059-020-01988-3}, year = {2020}, date = {2020-03-17}, journal = {Genome Biology}, volume = {21}, number = {71}, publisher = {Cold Spring Harbor Laboratory}, abstract = {Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown. |

Nicholas (Niko) Darci-Maher
Undergraduate Researcher
niko.darcimaher@gmail.com • View Niko’s papers
Mangul Lab papers authored by Niko
Sarwal, Varuni; Niehus, Sebastian; Ayyala, Ram; Chang, Sei; Lu, Angela; Darci-Maher, Nicholas; Littman, Russell Jared; Wesel, Emily; Castellanos, Jacqueline; Chikka, Rahul; Distler, Margaret G; Eskin, Eleazar; Flint, Jonathan; Mangul, Serghei A comprehensive benchmarking of WGS-based structural variant callers Journal Article bioRxiv, 2020. Abstract | Links | BibTeX | Altmetric @article{Sarwal2020, title = {A comprehensive benchmarking of WGS-based structural variant callers}, author = {Varuni Sarwal and Sebastian Niehus and Ram Ayyala and Sei Chang and Angela Lu and Nicholas Darci-Maher and Russell Jared Littman and Emily Wesel and Jacqueline Castellanos and Rahul Chikka and Margaret G Distler and Eleazar Eskin and Jonathan Flint and Serghei Mangul}, url = {https://www.biorxiv.org/content/10.1101/2020.04.16.045120v1}, doi = {10.1101/2020.04.16.045120}, year = {2020}, date = {2020-04-18}, journal = {bioRxiv}, abstract = {Advances in whole genome sequencing promise to enable the accurate and comprehensive structural variant (SV) discovery. Dissecting SVs from whole genome sequencing (WGS) data presents a substantial number of challenges and a plethora of SV-detection methods have been developed. Currently, there is a paucity of evidence which investigators can use to select appropriate SV-detection tools. In this paper, we evaluated the performance of SV-detection tools using a comprehensive PCR-confirmed gold standard set of SVs. In contrast to the previous benchmarking studies, our gold standard dataset included a complete set of SVs allowing us to report both precision and sensitivity rates of SV-detection methods. Our study investigates the ability of the methods to detect deletions, thus providing an optimistic estimate of SV detection performance, as the SV-detection methods that fail to detect deletions are likely to miss more complex SVs. We found that SV-detection tools varied widely in their performance, with several methods providing a good balance between sensitivity and precision. Additionally, we have determined the SV callers best suited for low and ultra-low pass sequencing data.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Advances in whole genome sequencing promise to enable the accurate and comprehensive structural variant (SV) discovery. Dissecting SVs from whole genome sequencing (WGS) data presents a substantial number of challenges and a plethora of SV-detection methods have been developed. Currently, there is a paucity of evidence which investigators can use to select appropriate SV-detection tools. In this paper, we evaluated the performance of SV-detection tools using a comprehensive PCR-confirmed gold standard set of SVs. In contrast to the previous benchmarking studies, our gold standard dataset included a complete set of SVs allowing us to report both precision and sensitivity rates of SV-detection methods. Our study investigates the ability of the methods to detect deletions, thus providing an optimistic estimate of SV detection performance, as the SV-detection methods that fail to detect deletions are likely to miss more complex SVs. We found that SV-detection tools varied widely in their performance, with several methods providing a good balance between sensitivity and precision. Additionally, we have determined the SV callers best suited for low and ultra-low pass sequencing data. |

Daniel Yuen Wook Kim
Undergraduate Researcher
kimy17@rpi.edu • View Daniel’s papers
Mangul Lab papers authored by Daniel
Mitchell, Keith; Brito, Jaqueline J; Mandric, Igor; Wu, Qiaozhen; Knyazev, Sergey; Chang, Sei; Martin, Lana S; Karlsberg, Aaron; Gerasimov, Ekaterina; Littman, Russell Jared; Hill, Brian L; Wu, Nicholas C; Yang, Harry Taegyun; Hsieh, Kevin; Chen, Linus; Littman, Eli; Shabani, Taylor; Shabanets, German; Yao, Douglas; Sun, Ren; Schroeder, Jan; Eskin, Eleazar; Zelikovsky, Alex; Skums, Pavel; Pop, Mihai; Mangul, Serghei Benchmarking of computational error-correction methods for next-generation sequencing data Journal Article Genome Biology, 21 (71), 2020. Abstract | Links | BibTeX | Altmetric @article{mitchell2019benchmarking, title = {Benchmarking of computational error-correction methods for next-generation sequencing data}, author = {Keith Mitchell and Jaqueline J Brito and Igor Mandric and Qiaozhen Wu and Sergey Knyazev and Sei Chang and Lana S Martin and Aaron Karlsberg and Ekaterina Gerasimov and Russell Jared Littman and Brian L Hill and Nicholas C Wu and Harry Taegyun Yang and Kevin Hsieh and Linus Chen and Eli Littman and Taylor Shabani and German Shabanets and Douglas Yao and Ren Sun and Jan Schroeder and Eleazar Eskin and Alex Zelikovsky and Pavel Skums and Mihai Pop and Serghei Mangul}, url = {https://doi.org/10.1186/s13059-020-01988-3}, doi = {10.1186/s13059-020-01988-3}, year = {2020}, date = {2020-03-17}, journal = {Genome Biology}, volume = {21}, number = {71}, publisher = {Cold Spring Harbor Laboratory}, abstract = {Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown. |

Qiaozhen (Jenny) Wu
Undergraduate Researcher
Jenny.Wu.Dalian@outlook.com • View Jenny’s papers
Mangul Lab papers authored by Jenny
Mitchell, Keith; Brito, Jaqueline J; Mandric, Igor; Wu, Qiaozhen; Knyazev, Sergey; Chang, Sei; Martin, Lana S; Karlsberg, Aaron; Gerasimov, Ekaterina; Littman, Russell Jared; Hill, Brian L; Wu, Nicholas C; Yang, Harry Taegyun; Hsieh, Kevin; Chen, Linus; Littman, Eli; Shabani, Taylor; Shabanets, German; Yao, Douglas; Sun, Ren; Schroeder, Jan; Eskin, Eleazar; Zelikovsky, Alex; Skums, Pavel; Pop, Mihai; Mangul, Serghei Benchmarking of computational error-correction methods for next-generation sequencing data Journal Article Genome Biology, 21 (71), 2020. Abstract | Links | BibTeX | Altmetric @article{mitchell2019benchmarking, title = {Benchmarking of computational error-correction methods for next-generation sequencing data}, author = {Keith Mitchell and Jaqueline J Brito and Igor Mandric and Qiaozhen Wu and Sergey Knyazev and Sei Chang and Lana S Martin and Aaron Karlsberg and Ekaterina Gerasimov and Russell Jared Littman and Brian L Hill and Nicholas C Wu and Harry Taegyun Yang and Kevin Hsieh and Linus Chen and Eli Littman and Taylor Shabani and German Shabanets and Douglas Yao and Ren Sun and Jan Schroeder and Eleazar Eskin and Alex Zelikovsky and Pavel Skums and Mihai Pop and Serghei Mangul}, url = {https://doi.org/10.1186/s13059-020-01988-3}, doi = {10.1186/s13059-020-01988-3}, year = {2020}, date = {2020-03-17}, journal = {Genome Biology}, volume = {21}, number = {71}, publisher = {Cold Spring Harbor Laboratory}, abstract = {Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown. |
Name | Role | Member during... | Currently... | Papers... |
---|---|---|---|---|
Caitlin Loeffler | Bioinformatics Analyst | 2016 to 2020 | PhD student @ George Washington University | View Caitlin's papers |
Aaron Karlsberg | Software Engineer | 2016 to 2020 | View Aaron's papers | |
Jeremy Rotman | Software Engineer | 2016 to 2020 | Data Software Engineer, Perimetrics | View Jeremy's papers |
Rahul Chikka | Undergrad Researcher | 2019 | Undergrad @ UCLA, Computer Science | View Rahul's papers |
Kevin Hsieh | Undergrad Researcher | 2016 to 2019 | Software Engineering Intern, Google | View Kevin's papers |
Victor Xue | Undergrad Researcher | 2018 to 2019 | Software Engineer, Northrop Grumman | View Victor's papers |
Jacqueline Castellanos | Undergrad Researcher | 2018 UCLA QCB B.I.G. Summer Scholar | View Jacqueline's papers | |
Emily Wesel | High School Researcher | 2018 UCLA QCB B.I.G. Summer Scholar | Undergrad @ Stanford, Computer Science | View Emily's papers |
Varuni Sarwal | Undergrad Researcher | 2016 to 2018; 2018 UCLA QCB B.I.G. Summer Scholar | Undergrad @ IIT Delhi, Biochemical Engineering | View Varuni's papers |
Keith Mitchell | Undergrad Researcher | 2016 to 2018; 2018 UCLA QCB B.I.G. Summer Scholar | Master's student @ UC Davis, Genetics & Genomics | View Keith's papers |
Linus Chen | Undergrad Researcher | 2016 to 2018; 2017 UCLA QCB B.I.G. Summer Scholar | Undergrad @ UCLA, Bioengineering & Biomedical Engineering | View Linus's papers |
Russell Littman | Undergrad Researcher | 2016 to 2018 | PhD student @ UCLA, Bioinformatics IDP | View Russell's papers |
Angela Ka-Mei Lam | Undergrad Researcher | 2017 to 2018 | Software Developer @ Ardent Labs | View Angela's papers |
Benjamin Statz | Undergrad Researcher | 2016 to 2018; 2016 UCLA QCB B.I.G. Summer Scholar | View Benjamin's papers | |
Will Van der Lay | High School Researcher | 2016 to 2017; 2016 UCLA QCB B.I.G. Summer Scholar | View Will's papers | |
Harry Taegyun Yang | Undergrad Researcher | 2016 to 2017 | PhD student @ UCLA, Bioinformatics IDP | View Harry's papers |
Garrett Parker | Undergrad Researcher | 2017 UCLA QCB B.I.G. Summer Scholar | View Garrett's papers | |
German Shabanets | High School Researcher | 2017 | Undergrad @ Stanford, Computer Science & Linguistics | View German's papers |
Taylor Shabani | High School Researcher | 2017 | Undergrad @ Duke, Computer Science & Economics | View Taylor's papers |
Teia Noel | Master's Student | 2017 | View Teia's papers | |
Kevin Wesel | High School Researcher | 2016 UCLA QCB B.I.G. Summer Scholar | Undergrad @ MIT, Biology & Econmics | View Kevin's papers |
Mangul Lab papers authored by Caitlin
Loeffler, Caitlin; Karlsberg, Aaron; Martin, Lana S; Eskin, Eleazar; Koslicki, David; Mangul, Serghei Improving the usability and comprehensiveness of microbial databases Journal Article BMC Biology, 18 (37), 2020. Abstract | Links | BibTeX | Altmetric @article{microbial2020, title = {Improving the usability and comprehensiveness of microbial databases}, author = {Caitlin Loeffler and Aaron Karlsberg and Lana S Martin and Eleazar Eskin and David Koslicki and Serghei Mangul}, url = {https://doi.org/10.1186/s12915-020-0756-z}, doi = {10.1186/s12915-020-0756-z}, year = {2020}, date = {2020-04-07}, journal = {BMC Biology}, volume = {18}, number = {37}, abstract = {Metagenomics studies leverage genomic reference databases to generate discoveries in basic science and translational research. However, current microbial studies use disparate reference databases that lack consistent standards of specimen inclusion, data preparation, taxon labelling and accessibility, hindering their quality and comprehensiveness, and calling for the establishment of recommendations for reference genome database assembly. Here, we analyze existing fungal and bacterial databases and discuss guidelines for the development of a master reference database that promises to improve the quality and quantity of omics research.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Metagenomics studies leverage genomic reference databases to generate discoveries in basic science and translational research. However, current microbial studies use disparate reference databases that lack consistent standards of specimen inclusion, data preparation, taxon labelling and accessibility, hindering their quality and comprehensiveness, and calling for the establishment of recommendations for reference genome database assembly. Here, we analyze existing fungal and bacterial databases and discuss guidelines for the development of a master reference database that promises to improve the quality and quantity of omics research. |
Loeffler, Caitlin; Gibson, Keylie M; Martin, Lana S; Chang, Yutong; Rotman, Jeremy; Toma, Ian V; Mason, Christopher E; Eskin, Eleazar; Zackular, Joseph P; Crandall, Keith A; Koslicki, David; Mangul, Serghei Metagenomics for clinical diagnostics: technologies and informatics Journal Article arXiv, 2019. Abstract | Links | BibTeX | Altmetric @article{Loeffler2019b, title = {Metagenomics for clinical diagnostics: technologies and informatics}, author = {Caitlin Loeffler and Keylie M Gibson and Lana S Martin and Yutong Chang and Jeremy Rotman and Ian V Toma and Christopher E Mason and Eleazar Eskin and Joseph P Zackular and Keith A Crandall and David Koslicki and Serghei Mangul}, url = {https://arxiv.org/abs/1911.11304}, doi = {1911.11304}, year = {2019}, date = {2019-11-25}, journal = {arXiv}, abstract = {The human-associated microbiome is closely tied to human health and is of substantial clinical interest. Metagenomics-based tools are emerging for clinical diagnostics, tracking the spread of diseases, and surveillance of potential pathogens. In some cases, these tools are overcoming limitations of traditional clinical approaches. Metagenomics has limitations barring the tools from clinical validation. Once these hurdles are overcome, clinical metagenomics will inform doctors of the best, targeted treatment for their patients and provide early detection of disease. Here we present an overview of metagenomics methods with a discussion of computational challenges and limitations.}, keywords = {}, pubstate = {published}, tppubtype = {article} } The human-associated microbiome is closely tied to human health and is of substantial clinical interest. Metagenomics-based tools are emerging for clinical diagnostics, tracking the spread of diseases, and surveillance of potential pathogens. In some cases, these tools are overcoming limitations of traditional clinical approaches. Metagenomics has limitations barring the tools from clinical validation. Once these hurdles are overcome, clinical metagenomics will inform doctors of the best, targeted treatment for their patients and provide early detection of disease. Here we present an overview of metagenomics methods with a discussion of computational challenges and limitations. |
Mangul Lab papers authored by Aaron
Loeffler, Caitlin; Karlsberg, Aaron; Martin, Lana S; Eskin, Eleazar; Koslicki, David; Mangul, Serghei Improving the usability and comprehensiveness of microbial databases Journal Article BMC Biology, 18 (37), 2020. Abstract | Links | BibTeX | Altmetric @article{microbial2020, title = {Improving the usability and comprehensiveness of microbial databases}, author = {Caitlin Loeffler and Aaron Karlsberg and Lana S Martin and Eleazar Eskin and David Koslicki and Serghei Mangul}, url = {https://doi.org/10.1186/s12915-020-0756-z}, doi = {10.1186/s12915-020-0756-z}, year = {2020}, date = {2020-04-07}, journal = {BMC Biology}, volume = {18}, number = {37}, abstract = {Metagenomics studies leverage genomic reference databases to generate discoveries in basic science and translational research. However, current microbial studies use disparate reference databases that lack consistent standards of specimen inclusion, data preparation, taxon labelling and accessibility, hindering their quality and comprehensiveness, and calling for the establishment of recommendations for reference genome database assembly. Here, we analyze existing fungal and bacterial databases and discuss guidelines for the development of a master reference database that promises to improve the quality and quantity of omics research.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Metagenomics studies leverage genomic reference databases to generate discoveries in basic science and translational research. However, current microbial studies use disparate reference databases that lack consistent standards of specimen inclusion, data preparation, taxon labelling and accessibility, hindering their quality and comprehensiveness, and calling for the establishment of recommendations for reference genome database assembly. Here, we analyze existing fungal and bacterial databases and discuss guidelines for the development of a master reference database that promises to improve the quality and quantity of omics research. |
Mitchell, Keith; Brito, Jaqueline J; Mandric, Igor; Wu, Qiaozhen; Knyazev, Sergey; Chang, Sei; Martin, Lana S; Karlsberg, Aaron; Gerasimov, Ekaterina; Littman, Russell Jared; Hill, Brian L; Wu, Nicholas C; Yang, Harry Taegyun; Hsieh, Kevin; Chen, Linus; Littman, Eli; Shabani, Taylor; Shabanets, German; Yao, Douglas; Sun, Ren; Schroeder, Jan; Eskin, Eleazar; Zelikovsky, Alex; Skums, Pavel; Pop, Mihai; Mangul, Serghei Benchmarking of computational error-correction methods for next-generation sequencing data Journal Article Genome Biology, 21 (71), 2020. Abstract | Links | BibTeX | Altmetric @article{mitchell2019benchmarking, title = {Benchmarking of computational error-correction methods for next-generation sequencing data}, author = {Keith Mitchell and Jaqueline J Brito and Igor Mandric and Qiaozhen Wu and Sergey Knyazev and Sei Chang and Lana S Martin and Aaron Karlsberg and Ekaterina Gerasimov and Russell Jared Littman and Brian L Hill and Nicholas C Wu and Harry Taegyun Yang and Kevin Hsieh and Linus Chen and Eli Littman and Taylor Shabani and German Shabanets and Douglas Yao and Ren Sun and Jan Schroeder and Eleazar Eskin and Alex Zelikovsky and Pavel Skums and Mihai Pop and Serghei Mangul}, url = {https://doi.org/10.1186/s13059-020-01988-3}, doi = {10.1186/s13059-020-01988-3}, year = {2020}, date = {2020-03-17}, journal = {Genome Biology}, volume = {21}, number = {71}, publisher = {Cold Spring Harbor Laboratory}, abstract = {Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown. |
Mangul Lab papers authored by Jeremy
Mandric, Igor; Rotman, Jeremy; Yang, Harry Taegyun; Strauli, Nicolas; Montoya, Dennis; Lay, Will Van Der; Ronas, Jiem R; Statz, Benjamin; Yao, Douglas; Petrova, Velislava; Zelikovsky, Alex; Spreafico, Roberto; Shifman, Sagiv; Zaitlen, Noah; Rossetti, Maura; Ansel, Mark K; Eskin, Eleazar; Mangul, Serghei Profiling immunoglobulin repertoires across multiple human tissues using RNA sequencing Journal Article Nature Communications, 11 (3126), 2020. Abstract | Links | BibTeX | Altmetric @article{mangul2016profiling, title = {Profiling immunoglobulin repertoires across multiple human tissues using RNA sequencing}, author = {Igor Mandric and Jeremy Rotman and Harry Taegyun Yang and Nicolas Strauli and Dennis Montoya and Will Van Der Lay and Jiem R Ronas and Benjamin Statz and Douglas Yao and Velislava Petrova and Alex Zelikovsky and Roberto Spreafico and Sagiv Shifman and Noah Zaitlen and Maura Rossetti and K. Mark Ansel and Eleazar Eskin and Serghei Mangul}, url = {https://doi.org/10.1038/s41467-020-16857-7}, doi = {10.1038/s41467-020-16857-7}, year = {2020}, date = {2020-06-19}, journal = {Nature Communications}, volume = {11}, number = {3126}, publisher = {Nature Publications}, abstract = {Profiling immunoglobulin (Ig) receptor repertoires with specialized assays can be cost-ineffective and time-consuming. Here we report ImReP, a computational method for rapid and accurate profiling of the Ig repertoire, including the complementary-determining region 3 (CDR3), using regular RNA sequencing data such as those from 8,555 samples across 53 tissues types from 544 individuals in the Genotype-Tissue Expression (GTEx v6) project. Using ImReP and GTEx v6 data, we generate a collection of 3.6 million Ig sequences, termed the atlas of immunoglobulin repertoires (TAIR), across a broad range of tissue types that often do not have reported Ig repertoires information. Moreover, the flow of Ig clonotypes and inter-tissue repertoire similarities across immune-related tissues are also evaluated. In summary, TAIR is one of the largest collections of CDR3 sequences and tissue types, and should serve as an important resource for studying immunological diseases.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Profiling immunoglobulin (Ig) receptor repertoires with specialized assays can be cost-ineffective and time-consuming. Here we report ImReP, a computational method for rapid and accurate profiling of the Ig repertoire, including the complementary-determining region 3 (CDR3), using regular RNA sequencing data such as those from 8,555 samples across 53 tissues types from 544 individuals in the Genotype-Tissue Expression (GTEx v6) project. Using ImReP and GTEx v6 data, we generate a collection of 3.6 million Ig sequences, termed the atlas of immunoglobulin repertoires (TAIR), across a broad range of tissue types that often do not have reported Ig repertoires information. Moreover, the flow of Ig clonotypes and inter-tissue repertoire similarities across immune-related tissues are also evaluated. In summary, TAIR is one of the largest collections of CDR3 sequences and tissue types, and should serve as an important resource for studying immunological diseases. |
Alser, Mohammed; Rotman, Jeremy; Taraszka, Kodi; Shi, Huwenbo; Baykal, Pelin Icer; Yang, Harry Taegyun; Xue, Victor; Knyazev, Sergey; Singer, Benjamin D; Balliu, Brunilda; Koslicki, David; Skums, Pavel; Zelikovsky, Alex; Alkan, Can; Mutlu, Onur; Mangul, Serghei Technology dictates algorithms: Recent developments in read alignment Journal Article arXiv, 2020. Abstract | Links | BibTeX | Altmetric @article{Alser2020, title = {Technology dictates algorithms: Recent developments in read alignment}, author = {Mohammed Alser and Jeremy Rotman and Kodi Taraszka and Huwenbo Shi and Pelin Icer Baykal and Harry Taegyun Yang and Victor Xue and Sergey Knyazev and Benjamin D Singer and Brunilda Balliu and David Koslicki and Pavel Skums and Alex Zelikovsky and Can Alkan and Onur Mutlu and Serghei Mangul}, url = {https://arxiv.org/abs/2003.00110}, doi = {2003.00110}, year = {2020}, date = {2020-02-28}, journal = {arXiv}, abstract = {Massively parallel sequencing techniques have revolutionized biological and medical sciences by providing unprecedented insight into the genomes of humans, animals, and microbes. Aligned reads are essential for answering important biological questions, such as detecting mutations driving various human diseases and complex traits as well as identifying species present in metagenomic samples. The read alignment problem is extremely challenging due to the large size of analyzed datasets and numerous technological limitations of sequencing platforms, and researchers have developed novel bioinformatics algorithms to tackle these difficulties. Our review provides a survey of algorithmic foundations and methodologies across alignment methods for both short and long reads. We provide rigorous experimental evaluation of 11 read aligners to demonstrate the effect of these underlying algorithms on speed and efficiency of read aligners. We separately discuss how longer read lengths produce unique advantages and limitations to read alignment techniques. We also discuss how general alignment algorithms have been tailored to the specific needs of various domains in biology, including whole transcriptome, adaptive immune repertoire, and human microbiome studies.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Massively parallel sequencing techniques have revolutionized biological and medical sciences by providing unprecedented insight into the genomes of humans, animals, and microbes. Aligned reads are essential for answering important biological questions, such as detecting mutations driving various human diseases and complex traits as well as identifying species present in metagenomic samples. The read alignment problem is extremely challenging due to the large size of analyzed datasets and numerous technological limitations of sequencing platforms, and researchers have developed novel bioinformatics algorithms to tackle these difficulties. Our review provides a survey of algorithmic foundations and methodologies across alignment methods for both short and long reads. We provide rigorous experimental evaluation of 11 read aligners to demonstrate the effect of these underlying algorithms on speed and efficiency of read aligners. We separately discuss how longer read lengths produce unique advantages and limitations to read alignment techniques. We also discuss how general alignment algorithms have been tailored to the specific needs of various domains in biology, including whole transcriptome, adaptive immune repertoire, and human microbiome studies. |
Brito, Jaqueline J; Mosqueiro, Thiago; Rotman, Jeremy; Xue, Victor; Chapski, Douglas J; la Hoz, Juan De; Matias, Paulo; Martin, Lana S; Zelikovsky, Alex; Pellegrini, Matteo; Mangul, Serghei Telescope: an interactive tool for managing large scale analysis from mobile devices Journal Article GigaScience, 9 (1), pp. giz163, 2020. Abstract | Links | BibTeX | Altmetric @article{Brito2019, title = {Telescope: an interactive tool for managing large scale analysis from mobile devices}, author = {Jaqueline J Brito and Thiago Mosqueiro and Jeremy Rotman and Victor Xue and Douglas J Chapski and Juan De la Hoz and Paulo Matias and Lana S Martin and Alex Zelikovsky and Matteo Pellegrini and Serghei Mangul}, url = {https://doi.org/10.1093/gigascience/giz163}, doi = {10.1093/gigascience/giz163}, year = {2020}, date = {2020-01-23}, journal = {GigaScience}, volume = {9}, number = {1}, pages = {giz163}, abstract = {Background In today's world of big data, computational analysis has become a key driver of biomedical research. High-performance computational facilities are capable of processing considerable volumes of data, yet often lack an easy-to-use interface to guide the user in supervising and adjusting bioinformatics analysis via a tablet or smartphone. Results To address this gap we proposed Telescope, a novel tool that interfaces with high-performance computational clusters to deliver an intuitive user interface for controlling and monitoring bioinformatics analyses in real-time. By leveraging last generation technology now ubiquitous to most researchers (such as smartphones), Telescope delivers a friendly user experience and manages conectivity and encryption under the hood. Conclusions Telescope helps to mitigate the digital divide between wet and computational laboratories in contemporary biology. By delivering convenience and ease of use through a user experience not relying on expertise with computational clusters, Telescope can help researchers close the feedback loop between bioinformatics and experimental work with minimal impact on the performance of computational tools. Telescope is freely available at https://github.com/Mangul-Lab-USC/telescope.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Background In today's world of big data, computational analysis has become a key driver of biomedical research. High-performance computational facilities are capable of processing considerable volumes of data, yet often lack an easy-to-use interface to guide the user in supervising and adjusting bioinformatics analysis via a tablet or smartphone. Results To address this gap we proposed Telescope, a novel tool that interfaces with high-performance computational clusters to deliver an intuitive user interface for controlling and monitoring bioinformatics analyses in real-time. By leveraging last generation technology now ubiquitous to most researchers (such as smartphones), Telescope delivers a friendly user experience and manages conectivity and encryption under the hood. Conclusions Telescope helps to mitigate the digital divide between wet and computational laboratories in contemporary biology. By delivering convenience and ease of use through a user experience not relying on expertise with computational clusters, Telescope can help researchers close the feedback loop between bioinformatics and experimental work with minimal impact on the performance of computational tools. Telescope is freely available at https://github.com/Mangul-Lab-USC/telescope. |
Loeffler, Caitlin; Gibson, Keylie M; Martin, Lana S; Chang, Yutong; Rotman, Jeremy; Toma, Ian V; Mason, Christopher E; Eskin, Eleazar; Zackular, Joseph P; Crandall, Keith A; Koslicki, David; Mangul, Serghei Metagenomics for clinical diagnostics: technologies and informatics Journal Article arXiv, 2019. Abstract | Links | BibTeX | Altmetric @article{Loeffler2019b, title = {Metagenomics for clinical diagnostics: technologies and informatics}, author = {Caitlin Loeffler and Keylie M Gibson and Lana S Martin and Yutong Chang and Jeremy Rotman and Ian V Toma and Christopher E Mason and Eleazar Eskin and Joseph P Zackular and Keith A Crandall and David Koslicki and Serghei Mangul}, url = {https://arxiv.org/abs/1911.11304}, doi = {1911.11304}, year = {2019}, date = {2019-11-25}, journal = {arXiv}, abstract = {The human-associated microbiome is closely tied to human health and is of substantial clinical interest. Metagenomics-based tools are emerging for clinical diagnostics, tracking the spread of diseases, and surveillance of potential pathogens. In some cases, these tools are overcoming limitations of traditional clinical approaches. Metagenomics has limitations barring the tools from clinical validation. Once these hurdles are overcome, clinical metagenomics will inform doctors of the best, targeted treatment for their patients and provide early detection of disease. Here we present an overview of metagenomics methods with a discussion of computational challenges and limitations.}, keywords = {}, pubstate = {published}, tppubtype = {article} } The human-associated microbiome is closely tied to human health and is of substantial clinical interest. Metagenomics-based tools are emerging for clinical diagnostics, tracking the spread of diseases, and surveillance of potential pathogens. In some cases, these tools are overcoming limitations of traditional clinical approaches. Metagenomics has limitations barring the tools from clinical validation. Once these hurdles are overcome, clinical metagenomics will inform doctors of the best, targeted treatment for their patients and provide early detection of disease. Here we present an overview of metagenomics methods with a discussion of computational challenges and limitations. |
Thomas, Brandon; Karimzada, Mohammed; Spreafico, Roberto; Mangul, Serghei; Botten, Jason W; Rotman, Jeremy; Wesel, Kevin; Binder, Pratibha S; Gharavi, Nima; Chesnut, Robert W 104 Lack of human papilloma virus transcription in cutaneous squamous cell carcinoma stratified by histological grade and host immune status Journal Article Journal of Investigative Dermatology, 137 (5), pp. S18, 2017. Abstract | Links | BibTeX | Altmetric @article{thomas2017104, title = {104 Lack of human papilloma virus transcription in cutaneous squamous cell carcinoma stratified by histological grade and host immune status}, author = {Brandon Thomas and Mohammed Karimzada and Roberto Spreafico and Serghei Mangul and Jason W Botten and Jeremy Rotman and Kevin Wesel and Pratibha S Binder and Nima Gharavi and Robert W Chesnut}, url = {https://doi.org/10.1016/j.jid.2017.02.118}, doi = {10.1016/j.jid.2017.02.118}, year = {2017}, date = {2017-01-01}, journal = {Journal of Investigative Dermatology}, volume = {137}, number = {5}, pages = {S18}, publisher = {Elsevier}, abstract = {Human Papilloma Virus (HPV) infection is known to contribute to mucosal (m)SCC, but its role in cutaneous (c)SCC progression remains unclear, especially in lesions determined to be at high-risk for metastasis. We hypothesized that histologically high grade cSCCs in immunosuppressed patients would display increased transcriptional activity of HPV when compared to low histologic grade lesions in otherwise healthy patients. To assess the role of viruses in cSCC pathogenesis we utilized high throughput RNA sequencing across risk-stratified lesions. A total of 22 skin excisions (11 classified as high grade in immunocompromised patients, 8 classified as low grade in otherwise healthy patients, and 3 as normal skin) were used for detection of any non-human RNA. Reads were aligned to known viral transcriptomes using our recently developed Microbiome Coverage Profiler. While approximately two-thirds of all samples tested positive for HPV gDNA, no skin sample had detectable expression of HPV RNA. Instead, many were found to have expression of Human Endogenous Retroviruses, Simian Virus 40, and Staphylococcus Prophages, while analysis of published datasets of sequenced HeLa cells demonstrated numerous RNA reads for HPV. These results suggest that either HPV does not participate in cSCC development, or facilitates cSCC initiation without effecting tumor progression. The ability to monitor viral and prophage gene expression in skin biopsies will provide insights into the interplay of host-pathogen interactions, and the framework described herein can be used to analyze skin biopsies to facilitate understanding in cases where pathogens are thought to contribute to disease pathogenesis.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Human Papilloma Virus (HPV) infection is known to contribute to mucosal (m)SCC, but its role in cutaneous (c)SCC progression remains unclear, especially in lesions determined to be at high-risk for metastasis. We hypothesized that histologically high grade cSCCs in immunosuppressed patients would display increased transcriptional activity of HPV when compared to low histologic grade lesions in otherwise healthy patients. To assess the role of viruses in cSCC pathogenesis we utilized high throughput RNA sequencing across risk-stratified lesions. A total of 22 skin excisions (11 classified as high grade in immunocompromised patients, 8 classified as low grade in otherwise healthy patients, and 3 as normal skin) were used for detection of any non-human RNA. Reads were aligned to known viral transcriptomes using our recently developed Microbiome Coverage Profiler. While approximately two-thirds of all samples tested positive for HPV gDNA, no skin sample had detectable expression of HPV RNA. Instead, many were found to have expression of Human Endogenous Retroviruses, Simian Virus 40, and Staphylococcus Prophages, while analysis of published datasets of sequenced HeLa cells demonstrated numerous RNA reads for HPV. These results suggest that either HPV does not participate in cSCC development, or facilitates cSCC initiation without effecting tumor progression. The ability to monitor viral and prophage gene expression in skin biopsies will provide insights into the interplay of host-pathogen interactions, and the framework described herein can be used to analyze skin biopsies to facilitate understanding in cases where pathogens are thought to contribute to disease pathogenesis. |
Mangul Lab papers authored by Rahul
Sarwal, Varuni; Niehus, Sebastian; Ayyala, Ram; Chang, Sei; Lu, Angela; Darci-Maher, Nicholas; Littman, Russell Jared; Wesel, Emily; Castellanos, Jacqueline; Chikka, Rahul; Distler, Margaret G; Eskin, Eleazar; Flint, Jonathan; Mangul, Serghei A comprehensive benchmarking of WGS-based structural variant callers Journal Article bioRxiv, 2020. Abstract | Links | BibTeX | Altmetric @article{Sarwal2020, title = {A comprehensive benchmarking of WGS-based structural variant callers}, author = {Varuni Sarwal and Sebastian Niehus and Ram Ayyala and Sei Chang and Angela Lu and Nicholas Darci-Maher and Russell Jared Littman and Emily Wesel and Jacqueline Castellanos and Rahul Chikka and Margaret G Distler and Eleazar Eskin and Jonathan Flint and Serghei Mangul}, url = {https://www.biorxiv.org/content/10.1101/2020.04.16.045120v1}, doi = {10.1101/2020.04.16.045120}, year = {2020}, date = {2020-04-18}, journal = {bioRxiv}, abstract = {Advances in whole genome sequencing promise to enable the accurate and comprehensive structural variant (SV) discovery. Dissecting SVs from whole genome sequencing (WGS) data presents a substantial number of challenges and a plethora of SV-detection methods have been developed. Currently, there is a paucity of evidence which investigators can use to select appropriate SV-detection tools. In this paper, we evaluated the performance of SV-detection tools using a comprehensive PCR-confirmed gold standard set of SVs. In contrast to the previous benchmarking studies, our gold standard dataset included a complete set of SVs allowing us to report both precision and sensitivity rates of SV-detection methods. Our study investigates the ability of the methods to detect deletions, thus providing an optimistic estimate of SV detection performance, as the SV-detection methods that fail to detect deletions are likely to miss more complex SVs. We found that SV-detection tools varied widely in their performance, with several methods providing a good balance between sensitivity and precision. Additionally, we have determined the SV callers best suited for low and ultra-low pass sequencing data.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Advances in whole genome sequencing promise to enable the accurate and comprehensive structural variant (SV) discovery. Dissecting SVs from whole genome sequencing (WGS) data presents a substantial number of challenges and a plethora of SV-detection methods have been developed. Currently, there is a paucity of evidence which investigators can use to select appropriate SV-detection tools. In this paper, we evaluated the performance of SV-detection tools using a comprehensive PCR-confirmed gold standard set of SVs. In contrast to the previous benchmarking studies, our gold standard dataset included a complete set of SVs allowing us to report both precision and sensitivity rates of SV-detection methods. Our study investigates the ability of the methods to detect deletions, thus providing an optimistic estimate of SV detection performance, as the SV-detection methods that fail to detect deletions are likely to miss more complex SVs. We found that SV-detection tools varied widely in their performance, with several methods providing a good balance between sensitivity and precision. Additionally, we have determined the SV callers best suited for low and ultra-low pass sequencing data. |
Mangul Lab papers authored by Kevin
Mitchell, Keith; Brito, Jaqueline J; Mandric, Igor; Wu, Qiaozhen; Knyazev, Sergey; Chang, Sei; Martin, Lana S; Karlsberg, Aaron; Gerasimov, Ekaterina; Littman, Russell Jared; Hill, Brian L; Wu, Nicholas C; Yang, Harry Taegyun; Hsieh, Kevin; Chen, Linus; Littman, Eli; Shabani, Taylor; Shabanets, German; Yao, Douglas; Sun, Ren; Schroeder, Jan; Eskin, Eleazar; Zelikovsky, Alex; Skums, Pavel; Pop, Mihai; Mangul, Serghei Benchmarking of computational error-correction methods for next-generation sequencing data Journal Article Genome Biology, 21 (71), 2020. Abstract | Links | BibTeX | Altmetric @article{mitchell2019benchmarking, title = {Benchmarking of computational error-correction methods for next-generation sequencing data}, author = {Keith Mitchell and Jaqueline J Brito and Igor Mandric and Qiaozhen Wu and Sergey Knyazev and Sei Chang and Lana S Martin and Aaron Karlsberg and Ekaterina Gerasimov and Russell Jared Littman and Brian L Hill and Nicholas C Wu and Harry Taegyun Yang and Kevin Hsieh and Linus Chen and Eli Littman and Taylor Shabani and German Shabanets and Douglas Yao and Ren Sun and Jan Schroeder and Eleazar Eskin and Alex Zelikovsky and Pavel Skums and Mihai Pop and Serghei Mangul}, url = {https://doi.org/10.1186/s13059-020-01988-3}, doi = {10.1186/s13059-020-01988-3}, year = {2020}, date = {2020-03-17}, journal = {Genome Biology}, volume = {21}, number = {71}, publisher = {Cold Spring Harbor Laboratory}, abstract = {Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown. |
Mangul, Serghei; Yang, Harry Taegyun; Strauli, Nicolas; Gruhl, Franziska; Porath, Hagit T; Hsieh, Kevin; Chen, Linus; Daley, Timothy; Christenson, Stephanie; Wesolowska-Andersen, Agata ROP: dumpster diving in RNA-sequencing to find the source of 1 trillion reads across diverse adult human tissues Journal Article Genome Biology, 19 (1), pp. 36, 2018. Abstract | Links | BibTeX | Altmetric @article{mangul2018rop, title = {ROP: dumpster diving in RNA-sequencing to find the source of 1 trillion reads across diverse adult human tissues}, author = {Serghei Mangul and Harry Taegyun Yang and Nicolas Strauli and Franziska Gruhl and Hagit T Porath and Kevin Hsieh and Linus Chen and Timothy Daley and Stephanie Christenson and Agata Wesolowska-Andersen}, url = {https://doi.org/10.1186/s13059-018-1403-7}, doi = {10.1186/s13059-018-1403-7}, year = {2018}, date = {2018-02-02}, journal = {Genome Biology}, volume = {19}, number = {1}, pages = {36}, publisher = {BioMed Central}, abstract = {High-throughput RNA-sequencing (RNA-seq) technologies provide an unprecedented opportunity to explore the individual transcriptome. Unmapped reads are a large and often overlooked output of standard RNA-seq analyses. Here, we present Read Origin Protocol (ROP), a tool for discovering the source of all reads originating from complex RNA molecules. We apply ROP to samples across 2630 individuals from 54 diverse human tissues. Our approach can account for 99.9% of 1 trillion reads of various read length. Additionally, we use ROP to investigate the functional mechanisms underlying connections between the immune system, microbiome, and disease. ROP is freely available at https://github.com/smangul1/rop/wiki.}, keywords = {}, pubstate = {published}, tppubtype = {article} } High-throughput RNA-sequencing (RNA-seq) technologies provide an unprecedented opportunity to explore the individual transcriptome. Unmapped reads are a large and often overlooked output of standard RNA-seq analyses. Here, we present Read Origin Protocol (ROP), a tool for discovering the source of all reads originating from complex RNA molecules. We apply ROP to samples across 2630 individuals from 54 diverse human tissues. Our approach can account for 99.9% of 1 trillion reads of various read length. Additionally, we use ROP to investigate the functional mechanisms underlying connections between the immune system, microbiome, and disease. ROP is freely available at https://github.com/smangul1/rop/wiki. |
Mangul Lab papers authored by Victor
Alser, Mohammed; Rotman, Jeremy; Taraszka, Kodi; Shi, Huwenbo; Baykal, Pelin Icer; Yang, Harry Taegyun; Xue, Victor; Knyazev, Sergey; Singer, Benjamin D; Balliu, Brunilda; Koslicki, David; Skums, Pavel; Zelikovsky, Alex; Alkan, Can; Mutlu, Onur; Mangul, Serghei Technology dictates algorithms: Recent developments in read alignment Journal Article arXiv, 2020. Abstract | Links | BibTeX | Altmetric @article{Alser2020, title = {Technology dictates algorithms: Recent developments in read alignment}, author = {Mohammed Alser and Jeremy Rotman and Kodi Taraszka and Huwenbo Shi and Pelin Icer Baykal and Harry Taegyun Yang and Victor Xue and Sergey Knyazev and Benjamin D Singer and Brunilda Balliu and David Koslicki and Pavel Skums and Alex Zelikovsky and Can Alkan and Onur Mutlu and Serghei Mangul}, url = {https://arxiv.org/abs/2003.00110}, doi = {2003.00110}, year = {2020}, date = {2020-02-28}, journal = {arXiv}, abstract = {Massively parallel sequencing techniques have revolutionized biological and medical sciences by providing unprecedented insight into the genomes of humans, animals, and microbes. Aligned reads are essential for answering important biological questions, such as detecting mutations driving various human diseases and complex traits as well as identifying species present in metagenomic samples. The read alignment problem is extremely challenging due to the large size of analyzed datasets and numerous technological limitations of sequencing platforms, and researchers have developed novel bioinformatics algorithms to tackle these difficulties. Our review provides a survey of algorithmic foundations and methodologies across alignment methods for both short and long reads. We provide rigorous experimental evaluation of 11 read aligners to demonstrate the effect of these underlying algorithms on speed and efficiency of read aligners. We separately discuss how longer read lengths produce unique advantages and limitations to read alignment techniques. We also discuss how general alignment algorithms have been tailored to the specific needs of various domains in biology, including whole transcriptome, adaptive immune repertoire, and human microbiome studies.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Massively parallel sequencing techniques have revolutionized biological and medical sciences by providing unprecedented insight into the genomes of humans, animals, and microbes. Aligned reads are essential for answering important biological questions, such as detecting mutations driving various human diseases and complex traits as well as identifying species present in metagenomic samples. The read alignment problem is extremely challenging due to the large size of analyzed datasets and numerous technological limitations of sequencing platforms, and researchers have developed novel bioinformatics algorithms to tackle these difficulties. Our review provides a survey of algorithmic foundations and methodologies across alignment methods for both short and long reads. We provide rigorous experimental evaluation of 11 read aligners to demonstrate the effect of these underlying algorithms on speed and efficiency of read aligners. We separately discuss how longer read lengths produce unique advantages and limitations to read alignment techniques. We also discuss how general alignment algorithms have been tailored to the specific needs of various domains in biology, including whole transcriptome, adaptive immune repertoire, and human microbiome studies. |
Brito, Jaqueline J; Mosqueiro, Thiago; Rotman, Jeremy; Xue, Victor; Chapski, Douglas J; la Hoz, Juan De; Matias, Paulo; Martin, Lana S; Zelikovsky, Alex; Pellegrini, Matteo; Mangul, Serghei Telescope: an interactive tool for managing large scale analysis from mobile devices Journal Article GigaScience, 9 (1), pp. giz163, 2020. Abstract | Links | BibTeX | Altmetric @article{Brito2019, title = {Telescope: an interactive tool for managing large scale analysis from mobile devices}, author = {Jaqueline J Brito and Thiago Mosqueiro and Jeremy Rotman and Victor Xue and Douglas J Chapski and Juan De la Hoz and Paulo Matias and Lana S Martin and Alex Zelikovsky and Matteo Pellegrini and Serghei Mangul}, url = {https://doi.org/10.1093/gigascience/giz163}, doi = {10.1093/gigascience/giz163}, year = {2020}, date = {2020-01-23}, journal = {GigaScience}, volume = {9}, number = {1}, pages = {giz163}, abstract = {Background In today's world of big data, computational analysis has become a key driver of biomedical research. High-performance computational facilities are capable of processing considerable volumes of data, yet often lack an easy-to-use interface to guide the user in supervising and adjusting bioinformatics analysis via a tablet or smartphone. Results To address this gap we proposed Telescope, a novel tool that interfaces with high-performance computational clusters to deliver an intuitive user interface for controlling and monitoring bioinformatics analyses in real-time. By leveraging last generation technology now ubiquitous to most researchers (such as smartphones), Telescope delivers a friendly user experience and manages conectivity and encryption under the hood. Conclusions Telescope helps to mitigate the digital divide between wet and computational laboratories in contemporary biology. By delivering convenience and ease of use through a user experience not relying on expertise with computational clusters, Telescope can help researchers close the feedback loop between bioinformatics and experimental work with minimal impact on the performance of computational tools. Telescope is freely available at https://github.com/Mangul-Lab-USC/telescope.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Background In today's world of big data, computational analysis has become a key driver of biomedical research. High-performance computational facilities are capable of processing considerable volumes of data, yet often lack an easy-to-use interface to guide the user in supervising and adjusting bioinformatics analysis via a tablet or smartphone. Results To address this gap we proposed Telescope, a novel tool that interfaces with high-performance computational clusters to deliver an intuitive user interface for controlling and monitoring bioinformatics analyses in real-time. By leveraging last generation technology now ubiquitous to most researchers (such as smartphones), Telescope delivers a friendly user experience and manages conectivity and encryption under the hood. Conclusions Telescope helps to mitigate the digital divide between wet and computational laboratories in contemporary biology. By delivering convenience and ease of use through a user experience not relying on expertise with computational clusters, Telescope can help researchers close the feedback loop between bioinformatics and experimental work with minimal impact on the performance of computational tools. Telescope is freely available at https://github.com/Mangul-Lab-USC/telescope. |
Mangul Lab papers authored by Jacqueline
Sarwal, Varuni; Niehus, Sebastian; Ayyala, Ram; Chang, Sei; Lu, Angela; Darci-Maher, Nicholas; Littman, Russell Jared; Wesel, Emily; Castellanos, Jacqueline; Chikka, Rahul; Distler, Margaret G; Eskin, Eleazar; Flint, Jonathan; Mangul, Serghei A comprehensive benchmarking of WGS-based structural variant callers Journal Article bioRxiv, 2020. Abstract | Links | BibTeX | Altmetric @article{Sarwal2020, title = {A comprehensive benchmarking of WGS-based structural variant callers}, author = {Varuni Sarwal and Sebastian Niehus and Ram Ayyala and Sei Chang and Angela Lu and Nicholas Darci-Maher and Russell Jared Littman and Emily Wesel and Jacqueline Castellanos and Rahul Chikka and Margaret G Distler and Eleazar Eskin and Jonathan Flint and Serghei Mangul}, url = {https://www.biorxiv.org/content/10.1101/2020.04.16.045120v1}, doi = {10.1101/2020.04.16.045120}, year = {2020}, date = {2020-04-18}, journal = {bioRxiv}, abstract = {Advances in whole genome sequencing promise to enable the accurate and comprehensive structural variant (SV) discovery. Dissecting SVs from whole genome sequencing (WGS) data presents a substantial number of challenges and a plethora of SV-detection methods have been developed. Currently, there is a paucity of evidence which investigators can use to select appropriate SV-detection tools. In this paper, we evaluated the performance of SV-detection tools using a comprehensive PCR-confirmed gold standard set of SVs. In contrast to the previous benchmarking studies, our gold standard dataset included a complete set of SVs allowing us to report both precision and sensitivity rates of SV-detection methods. Our study investigates the ability of the methods to detect deletions, thus providing an optimistic estimate of SV detection performance, as the SV-detection methods that fail to detect deletions are likely to miss more complex SVs. We found that SV-detection tools varied widely in their performance, with several methods providing a good balance between sensitivity and precision. Additionally, we have determined the SV callers best suited for low and ultra-low pass sequencing data.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Advances in whole genome sequencing promise to enable the accurate and comprehensive structural variant (SV) discovery. Dissecting SVs from whole genome sequencing (WGS) data presents a substantial number of challenges and a plethora of SV-detection methods have been developed. Currently, there is a paucity of evidence which investigators can use to select appropriate SV-detection tools. In this paper, we evaluated the performance of SV-detection tools using a comprehensive PCR-confirmed gold standard set of SVs. In contrast to the previous benchmarking studies, our gold standard dataset included a complete set of SVs allowing us to report both precision and sensitivity rates of SV-detection methods. Our study investigates the ability of the methods to detect deletions, thus providing an optimistic estimate of SV detection performance, as the SV-detection methods that fail to detect deletions are likely to miss more complex SVs. We found that SV-detection tools varied widely in their performance, with several methods providing a good balance between sensitivity and precision. Additionally, we have determined the SV callers best suited for low and ultra-low pass sequencing data. |
Mangul Lab papers authored by Emily
Sarwal, Varuni; Niehus, Sebastian; Ayyala, Ram; Chang, Sei; Lu, Angela; Darci-Maher, Nicholas; Littman, Russell Jared; Wesel, Emily; Castellanos, Jacqueline; Chikka, Rahul; Distler, Margaret G; Eskin, Eleazar; Flint, Jonathan; Mangul, Serghei A comprehensive benchmarking of WGS-based structural variant callers Journal Article bioRxiv, 2020. Abstract | Links | BibTeX | Altmetric @article{Sarwal2020, title = {A comprehensive benchmarking of WGS-based structural variant callers}, author = {Varuni Sarwal and Sebastian Niehus and Ram Ayyala and Sei Chang and Angela Lu and Nicholas Darci-Maher and Russell Jared Littman and Emily Wesel and Jacqueline Castellanos and Rahul Chikka and Margaret G Distler and Eleazar Eskin and Jonathan Flint and Serghei Mangul}, url = {https://www.biorxiv.org/content/10.1101/2020.04.16.045120v1}, doi = {10.1101/2020.04.16.045120}, year = {2020}, date = {2020-04-18}, journal = {bioRxiv}, abstract = {Advances in whole genome sequencing promise to enable the accurate and comprehensive structural variant (SV) discovery. Dissecting SVs from whole genome sequencing (WGS) data presents a substantial number of challenges and a plethora of SV-detection methods have been developed. Currently, there is a paucity of evidence which investigators can use to select appropriate SV-detection tools. In this paper, we evaluated the performance of SV-detection tools using a comprehensive PCR-confirmed gold standard set of SVs. In contrast to the previous benchmarking studies, our gold standard dataset included a complete set of SVs allowing us to report both precision and sensitivity rates of SV-detection methods. Our study investigates the ability of the methods to detect deletions, thus providing an optimistic estimate of SV detection performance, as the SV-detection methods that fail to detect deletions are likely to miss more complex SVs. We found that SV-detection tools varied widely in their performance, with several methods providing a good balance between sensitivity and precision. Additionally, we have determined the SV callers best suited for low and ultra-low pass sequencing data.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Advances in whole genome sequencing promise to enable the accurate and comprehensive structural variant (SV) discovery. Dissecting SVs from whole genome sequencing (WGS) data presents a substantial number of challenges and a plethora of SV-detection methods have been developed. Currently, there is a paucity of evidence which investigators can use to select appropriate SV-detection tools. In this paper, we evaluated the performance of SV-detection tools using a comprehensive PCR-confirmed gold standard set of SVs. In contrast to the previous benchmarking studies, our gold standard dataset included a complete set of SVs allowing us to report both precision and sensitivity rates of SV-detection methods. Our study investigates the ability of the methods to detect deletions, thus providing an optimistic estimate of SV detection performance, as the SV-detection methods that fail to detect deletions are likely to miss more complex SVs. We found that SV-detection tools varied widely in their performance, with several methods providing a good balance between sensitivity and precision. Additionally, we have determined the SV callers best suited for low and ultra-low pass sequencing data. |
Mangul Lab papers authored by Varuni
Sarwal, Varuni; Niehus, Sebastian; Ayyala, Ram; Chang, Sei; Lu, Angela; Darci-Maher, Nicholas; Littman, Russell Jared; Wesel, Emily; Castellanos, Jacqueline; Chikka, Rahul; Distler, Margaret G; Eskin, Eleazar; Flint, Jonathan; Mangul, Serghei A comprehensive benchmarking of WGS-based structural variant callers Journal Article bioRxiv, 2020. Abstract | Links | BibTeX | Altmetric @article{Sarwal2020, title = {A comprehensive benchmarking of WGS-based structural variant callers}, author = {Varuni Sarwal and Sebastian Niehus and Ram Ayyala and Sei Chang and Angela Lu and Nicholas Darci-Maher and Russell Jared Littman and Emily Wesel and Jacqueline Castellanos and Rahul Chikka and Margaret G Distler and Eleazar Eskin and Jonathan Flint and Serghei Mangul}, url = {https://www.biorxiv.org/content/10.1101/2020.04.16.045120v1}, doi = {10.1101/2020.04.16.045120}, year = {2020}, date = {2020-04-18}, journal = {bioRxiv}, abstract = {Advances in whole genome sequencing promise to enable the accurate and comprehensive structural variant (SV) discovery. Dissecting SVs from whole genome sequencing (WGS) data presents a substantial number of challenges and a plethora of SV-detection methods have been developed. Currently, there is a paucity of evidence which investigators can use to select appropriate SV-detection tools. In this paper, we evaluated the performance of SV-detection tools using a comprehensive PCR-confirmed gold standard set of SVs. In contrast to the previous benchmarking studies, our gold standard dataset included a complete set of SVs allowing us to report both precision and sensitivity rates of SV-detection methods. Our study investigates the ability of the methods to detect deletions, thus providing an optimistic estimate of SV detection performance, as the SV-detection methods that fail to detect deletions are likely to miss more complex SVs. We found that SV-detection tools varied widely in their performance, with several methods providing a good balance between sensitivity and precision. Additionally, we have determined the SV callers best suited for low and ultra-low pass sequencing data.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Advances in whole genome sequencing promise to enable the accurate and comprehensive structural variant (SV) discovery. Dissecting SVs from whole genome sequencing (WGS) data presents a substantial number of challenges and a plethora of SV-detection methods have been developed. Currently, there is a paucity of evidence which investigators can use to select appropriate SV-detection tools. In this paper, we evaluated the performance of SV-detection tools using a comprehensive PCR-confirmed gold standard set of SVs. In contrast to the previous benchmarking studies, our gold standard dataset included a complete set of SVs allowing us to report both precision and sensitivity rates of SV-detection methods. Our study investigates the ability of the methods to detect deletions, thus providing an optimistic estimate of SV detection performance, as the SV-detection methods that fail to detect deletions are likely to miss more complex SVs. We found that SV-detection tools varied widely in their performance, with several methods providing a good balance between sensitivity and precision. Additionally, we have determined the SV callers best suited for low and ultra-low pass sequencing data. |
Mangul, Serghei; Mosqueiro, Thiago; Abdill, Richard J; Duong, Dat; Mitchell, Keith; Sarwal, Varuni; Hill, Brian L; Brito, Jaqueline J; Littman, Russell Jared; Statz, Benjamin Challenges and recommendations to improve the installability and archival stability of omics computational tools Journal Article PLoS Biology, 17 (6), pp. e3000333, 2019. Abstract | Links | BibTeX | Altmetric @article{mangul2019challenges, title = {Challenges and recommendations to improve the installability and archival stability of omics computational tools}, author = {Serghei Mangul and Thiago Mosqueiro and Richard J Abdill and Dat Duong and Keith Mitchell and Varuni Sarwal and Brian L Hill and Jaqueline J Brito and Russell Jared Littman and Benjamin Statz}, url = {https://doi.org/10.1371/journal.pbio.3000333}, doi = {10.1371/journal.pbio.3000333}, year = {2019}, date = {2019-06-20}, journal = {PLoS Biology}, volume = {17}, number = {6}, pages = {e3000333}, publisher = {Public Library of Science}, abstract = {Developing new software tools for analysis of large-scale biological data is a key component of advancing modern biomedical research. Scientific reproduction of published findings requires running computational tools on data generated by such studies, yet little attention is presently allocated to the installability and archival stability of computational software tools. Scientific journals require data and code sharing, but none currently require authors to guarantee the continuing functionality of newly published tools. We have estimated the archival stability of computational biology software tools by performing an empirical analysis of the internet presence for 36,702 omics software resources published from 2005 to 2017. We found that almost 28% of all resources are currently not accessible through uniform resource locators (URLs) published in the paper they first appeared in. Among the 98 software tools selected for our installability test, 51% were deemed “easy to install,” and 28% of the tools failed to be installed at all because of problems in the implementation. Moreover, for papers introducing new software, we found that the number of citations significantly increased when authors provided an easy installation process. We propose for incorporation into journal policy several practical solutions for increasing the widespread installability and archival stability of published bioinformatics software.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Developing new software tools for analysis of large-scale biological data is a key component of advancing modern biomedical research. Scientific reproduction of published findings requires running computational tools on data generated by such studies, yet little attention is presently allocated to the installability and archival stability of computational software tools. Scientific journals require data and code sharing, but none currently require authors to guarantee the continuing functionality of newly published tools. We have estimated the archival stability of computational biology software tools by performing an empirical analysis of the internet presence for 36,702 omics software resources published from 2005 to 2017. We found that almost 28% of all resources are currently not accessible through uniform resource locators (URLs) published in the paper they first appeared in. Among the 98 software tools selected for our installability test, 51% were deemed “easy to install,” and 28% of the tools failed to be installed at all because of problems in the implementation. Moreover, for papers introducing new software, we found that the number of citations significantly increased when authors provided an easy installation process. We propose for incorporation into journal policy several practical solutions for increasing the widespread installability and archival stability of published bioinformatics software. |
Mangul Lab papers authored by Keith
Mitchell, Keith; Brito, Jaqueline J; Mandric, Igor; Wu, Qiaozhen; Knyazev, Sergey; Chang, Sei; Martin, Lana S; Karlsberg, Aaron; Gerasimov, Ekaterina; Littman, Russell Jared; Hill, Brian L; Wu, Nicholas C; Yang, Harry Taegyun; Hsieh, Kevin; Chen, Linus; Littman, Eli; Shabani, Taylor; Shabanets, German; Yao, Douglas; Sun, Ren; Schroeder, Jan; Eskin, Eleazar; Zelikovsky, Alex; Skums, Pavel; Pop, Mihai; Mangul, Serghei Benchmarking of computational error-correction methods for next-generation sequencing data Journal Article Genome Biology, 21 (71), 2020. Abstract | Links | BibTeX | Altmetric @article{mitchell2019benchmarking, title = {Benchmarking of computational error-correction methods for next-generation sequencing data}, author = {Keith Mitchell and Jaqueline J Brito and Igor Mandric and Qiaozhen Wu and Sergey Knyazev and Sei Chang and Lana S Martin and Aaron Karlsberg and Ekaterina Gerasimov and Russell Jared Littman and Brian L Hill and Nicholas C Wu and Harry Taegyun Yang and Kevin Hsieh and Linus Chen and Eli Littman and Taylor Shabani and German Shabanets and Douglas Yao and Ren Sun and Jan Schroeder and Eleazar Eskin and Alex Zelikovsky and Pavel Skums and Mihai Pop and Serghei Mangul}, url = {https://doi.org/10.1186/s13059-020-01988-3}, doi = {10.1186/s13059-020-01988-3}, year = {2020}, date = {2020-03-17}, journal = {Genome Biology}, volume = {21}, number = {71}, publisher = {Cold Spring Harbor Laboratory}, abstract = {Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown. |
Mangul, Serghei; Mosqueiro, Thiago; Abdill, Richard J; Duong, Dat; Mitchell, Keith; Sarwal, Varuni; Hill, Brian L; Brito, Jaqueline J; Littman, Russell Jared; Statz, Benjamin Challenges and recommendations to improve the installability and archival stability of omics computational tools Journal Article PLoS Biology, 17 (6), pp. e3000333, 2019. Abstract | Links | BibTeX | Altmetric @article{mangul2019challenges, title = {Challenges and recommendations to improve the installability and archival stability of omics computational tools}, author = {Serghei Mangul and Thiago Mosqueiro and Richard J Abdill and Dat Duong and Keith Mitchell and Varuni Sarwal and Brian L Hill and Jaqueline J Brito and Russell Jared Littman and Benjamin Statz}, url = {https://doi.org/10.1371/journal.pbio.3000333}, doi = {10.1371/journal.pbio.3000333}, year = {2019}, date = {2019-06-20}, journal = {PLoS Biology}, volume = {17}, number = {6}, pages = {e3000333}, publisher = {Public Library of Science}, abstract = {Developing new software tools for analysis of large-scale biological data is a key component of advancing modern biomedical research. Scientific reproduction of published findings requires running computational tools on data generated by such studies, yet little attention is presently allocated to the installability and archival stability of computational software tools. Scientific journals require data and code sharing, but none currently require authors to guarantee the continuing functionality of newly published tools. We have estimated the archival stability of computational biology software tools by performing an empirical analysis of the internet presence for 36,702 omics software resources published from 2005 to 2017. We found that almost 28% of all resources are currently not accessible through uniform resource locators (URLs) published in the paper they first appeared in. Among the 98 software tools selected for our installability test, 51% were deemed “easy to install,” and 28% of the tools failed to be installed at all because of problems in the implementation. Moreover, for papers introducing new software, we found that the number of citations significantly increased when authors provided an easy installation process. We propose for incorporation into journal policy several practical solutions for increasing the widespread installability and archival stability of published bioinformatics software.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Developing new software tools for analysis of large-scale biological data is a key component of advancing modern biomedical research. Scientific reproduction of published findings requires running computational tools on data generated by such studies, yet little attention is presently allocated to the installability and archival stability of computational software tools. Scientific journals require data and code sharing, but none currently require authors to guarantee the continuing functionality of newly published tools. We have estimated the archival stability of computational biology software tools by performing an empirical analysis of the internet presence for 36,702 omics software resources published from 2005 to 2017. We found that almost 28% of all resources are currently not accessible through uniform resource locators (URLs) published in the paper they first appeared in. Among the 98 software tools selected for our installability test, 51% were deemed “easy to install,” and 28% of the tools failed to be installed at all because of problems in the implementation. Moreover, for papers introducing new software, we found that the number of citations significantly increased when authors provided an easy installation process. We propose for incorporation into journal policy several practical solutions for increasing the widespread installability and archival stability of published bioinformatics software. |
Mitchell, Keith; Dao, Chris; Freise, Amanda; Mangul, Serghei; Parker, Jordan Moberg PUMA: A tool for processing 16S rRNA taxonomy data for analysis and visualization Journal Article bioRxiv, pp. 482380, 2018. Abstract | Links | BibTeX | Altmetric @article{mitchell2018puma, title = {PUMA: A tool for processing 16S rRNA taxonomy data for analysis and visualization}, author = {Keith Mitchell and Chris Dao and Amanda Freise and Serghei Mangul and Jordan Moberg Parker}, url = {https://doi.org/10.1101/482380}, doi = {10.1101/482380}, year = {2018}, date = {2018-11-29}, journal = {bioRxiv}, pages = {482380}, publisher = {Cold Spring Harbor Laboratory}, abstract = {Microbial community profiling and functional inference via 16S rRNA analysis is quickly expanding across various areas of microbiology due to improvements to technology. There are numerous platforms for producing 16S rRNA taxonomic data which often vary in file and sequence formatting, creating a common barrier in microbiome studies. Additionally, many of the methods for analyzing and visualizing this sequencing data each require their own specific formatting. As a result, efficient and reproducible comparative analysis of taxonomic data and corresponding metadata in multiple programs remains a challenge in the investigation of microbial communities. PUMA, the Program for Unifying Microbiome Analysis, alleviates this problem in microbiome studies by allowing users to take advantage of numerous 16S rRNA taxonomic identification platforms and analysis tools in an efficient manner. PUMA accepts sequencing results from several taxonomic identification platforms and then automates configuration of data and file types for analysis and visualization via many popular tools. The protocol accomplishes this by producing a variety of properly configured, annotated, and altered files for both analysis and visualization of taxonomic community profiles and inferred functional profiles. PUMA provides an easy and flexible interface to accommodate for a variety of users to produce all files needed for all-inclusive analysis of targeted amplicon sequencing studies. PUMA is an unprecedented open-source solution for unifying multiple microbiome analysis softwares and uses an adaptable implementation with the potential to improve and consolidate the state of microbiome research.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Microbial community profiling and functional inference via 16S rRNA analysis is quickly expanding across various areas of microbiology due to improvements to technology. There are numerous platforms for producing 16S rRNA taxonomic data which often vary in file and sequence formatting, creating a common barrier in microbiome studies. Additionally, many of the methods for analyzing and visualizing this sequencing data each require their own specific formatting. As a result, efficient and reproducible comparative analysis of taxonomic data and corresponding metadata in multiple programs remains a challenge in the investigation of microbial communities. PUMA, the Program for Unifying Microbiome Analysis, alleviates this problem in microbiome studies by allowing users to take advantage of numerous 16S rRNA taxonomic identification platforms and analysis tools in an efficient manner. PUMA accepts sequencing results from several taxonomic identification platforms and then automates configuration of data and file types for analysis and visualization via many popular tools. The protocol accomplishes this by producing a variety of properly configured, annotated, and altered files for both analysis and visualization of taxonomic community profiles and inferred functional profiles. PUMA provides an easy and flexible interface to accommodate for a variety of users to produce all files needed for all-inclusive analysis of targeted amplicon sequencing studies. PUMA is an unprecedented open-source solution for unifying multiple microbiome analysis softwares and uses an adaptable implementation with the potential to improve and consolidate the state of microbiome research. |
Mangul Lab papers authored by Linus
Mitchell, Keith; Brito, Jaqueline J; Mandric, Igor; Wu, Qiaozhen; Knyazev, Sergey; Chang, Sei; Martin, Lana S; Karlsberg, Aaron; Gerasimov, Ekaterina; Littman, Russell Jared; Hill, Brian L; Wu, Nicholas C; Yang, Harry Taegyun; Hsieh, Kevin; Chen, Linus; Littman, Eli; Shabani, Taylor; Shabanets, German; Yao, Douglas; Sun, Ren; Schroeder, Jan; Eskin, Eleazar; Zelikovsky, Alex; Skums, Pavel; Pop, Mihai; Mangul, Serghei Benchmarking of computational error-correction methods for next-generation sequencing data Journal Article Genome Biology, 21 (71), 2020. Abstract | Links | BibTeX | Altmetric @article{mitchell2019benchmarking, title = {Benchmarking of computational error-correction methods for next-generation sequencing data}, author = {Keith Mitchell and Jaqueline J Brito and Igor Mandric and Qiaozhen Wu and Sergey Knyazev and Sei Chang and Lana S Martin and Aaron Karlsberg and Ekaterina Gerasimov and Russell Jared Littman and Brian L Hill and Nicholas C Wu and Harry Taegyun Yang and Kevin Hsieh and Linus Chen and Eli Littman and Taylor Shabani and German Shabanets and Douglas Yao and Ren Sun and Jan Schroeder and Eleazar Eskin and Alex Zelikovsky and Pavel Skums and Mihai Pop and Serghei Mangul}, url = {https://doi.org/10.1186/s13059-020-01988-3}, doi = {10.1186/s13059-020-01988-3}, year = {2020}, date = {2020-03-17}, journal = {Genome Biology}, volume = {21}, number = {71}, publisher = {Cold Spring Harbor Laboratory}, abstract = {Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown. |
Mangul, Serghei; Yang, Harry Taegyun; Strauli, Nicolas; Gruhl, Franziska; Porath, Hagit T; Hsieh, Kevin; Chen, Linus; Daley, Timothy; Christenson, Stephanie; Wesolowska-Andersen, Agata ROP: dumpster diving in RNA-sequencing to find the source of 1 trillion reads across diverse adult human tissues Journal Article Genome Biology, 19 (1), pp. 36, 2018. Abstract | Links | BibTeX | Altmetric @article{mangul2018rop, title = {ROP: dumpster diving in RNA-sequencing to find the source of 1 trillion reads across diverse adult human tissues}, author = {Serghei Mangul and Harry Taegyun Yang and Nicolas Strauli and Franziska Gruhl and Hagit T Porath and Kevin Hsieh and Linus Chen and Timothy Daley and Stephanie Christenson and Agata Wesolowska-Andersen}, url = {https://doi.org/10.1186/s13059-018-1403-7}, doi = {10.1186/s13059-018-1403-7}, year = {2018}, date = {2018-02-02}, journal = {Genome Biology}, volume = {19}, number = {1}, pages = {36}, publisher = {BioMed Central}, abstract = {High-throughput RNA-sequencing (RNA-seq) technologies provide an unprecedented opportunity to explore the individual transcriptome. Unmapped reads are a large and often overlooked output of standard RNA-seq analyses. Here, we present Read Origin Protocol (ROP), a tool for discovering the source of all reads originating from complex RNA molecules. We apply ROP to samples across 2630 individuals from 54 diverse human tissues. Our approach can account for 99.9% of 1 trillion reads of various read length. Additionally, we use ROP to investigate the functional mechanisms underlying connections between the immune system, microbiome, and disease. ROP is freely available at https://github.com/smangul1/rop/wiki.}, keywords = {}, pubstate = {published}, tppubtype = {article} } High-throughput RNA-sequencing (RNA-seq) technologies provide an unprecedented opportunity to explore the individual transcriptome. Unmapped reads are a large and often overlooked output of standard RNA-seq analyses. Here, we present Read Origin Protocol (ROP), a tool for discovering the source of all reads originating from complex RNA molecules. We apply ROP to samples across 2630 individuals from 54 diverse human tissues. Our approach can account for 99.9% of 1 trillion reads of various read length. Additionally, we use ROP to investigate the functional mechanisms underlying connections between the immune system, microbiome, and disease. ROP is freely available at https://github.com/smangul1/rop/wiki. |
Mangul Lab papers authored by Russell
Sarwal, Varuni; Niehus, Sebastian; Ayyala, Ram; Chang, Sei; Lu, Angela; Darci-Maher, Nicholas; Littman, Russell Jared; Wesel, Emily; Castellanos, Jacqueline; Chikka, Rahul; Distler, Margaret G; Eskin, Eleazar; Flint, Jonathan; Mangul, Serghei A comprehensive benchmarking of WGS-based structural variant callers Journal Article bioRxiv, 2020. Abstract | Links | BibTeX | Altmetric @article{Sarwal2020, title = {A comprehensive benchmarking of WGS-based structural variant callers}, author = {Varuni Sarwal and Sebastian Niehus and Ram Ayyala and Sei Chang and Angela Lu and Nicholas Darci-Maher and Russell Jared Littman and Emily Wesel and Jacqueline Castellanos and Rahul Chikka and Margaret G Distler and Eleazar Eskin and Jonathan Flint and Serghei Mangul}, url = {https://www.biorxiv.org/content/10.1101/2020.04.16.045120v1}, doi = {10.1101/2020.04.16.045120}, year = {2020}, date = {2020-04-18}, journal = {bioRxiv}, abstract = {Advances in whole genome sequencing promise to enable the accurate and comprehensive structural variant (SV) discovery. Dissecting SVs from whole genome sequencing (WGS) data presents a substantial number of challenges and a plethora of SV-detection methods have been developed. Currently, there is a paucity of evidence which investigators can use to select appropriate SV-detection tools. In this paper, we evaluated the performance of SV-detection tools using a comprehensive PCR-confirmed gold standard set of SVs. In contrast to the previous benchmarking studies, our gold standard dataset included a complete set of SVs allowing us to report both precision and sensitivity rates of SV-detection methods. Our study investigates the ability of the methods to detect deletions, thus providing an optimistic estimate of SV detection performance, as the SV-detection methods that fail to detect deletions are likely to miss more complex SVs. We found that SV-detection tools varied widely in their performance, with several methods providing a good balance between sensitivity and precision. Additionally, we have determined the SV callers best suited for low and ultra-low pass sequencing data.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Advances in whole genome sequencing promise to enable the accurate and comprehensive structural variant (SV) discovery. Dissecting SVs from whole genome sequencing (WGS) data presents a substantial number of challenges and a plethora of SV-detection methods have been developed. Currently, there is a paucity of evidence which investigators can use to select appropriate SV-detection tools. In this paper, we evaluated the performance of SV-detection tools using a comprehensive PCR-confirmed gold standard set of SVs. In contrast to the previous benchmarking studies, our gold standard dataset included a complete set of SVs allowing us to report both precision and sensitivity rates of SV-detection methods. Our study investigates the ability of the methods to detect deletions, thus providing an optimistic estimate of SV detection performance, as the SV-detection methods that fail to detect deletions are likely to miss more complex SVs. We found that SV-detection tools varied widely in their performance, with several methods providing a good balance between sensitivity and precision. Additionally, we have determined the SV callers best suited for low and ultra-low pass sequencing data. |
Mitchell, Keith; Brito, Jaqueline J; Mandric, Igor; Wu, Qiaozhen; Knyazev, Sergey; Chang, Sei; Martin, Lana S; Karlsberg, Aaron; Gerasimov, Ekaterina; Littman, Russell Jared; Hill, Brian L; Wu, Nicholas C; Yang, Harry Taegyun; Hsieh, Kevin; Chen, Linus; Littman, Eli; Shabani, Taylor; Shabanets, German; Yao, Douglas; Sun, Ren; Schroeder, Jan; Eskin, Eleazar; Zelikovsky, Alex; Skums, Pavel; Pop, Mihai; Mangul, Serghei Benchmarking of computational error-correction methods for next-generation sequencing data Journal Article Genome Biology, 21 (71), 2020. Abstract | Links | BibTeX | Altmetric @article{mitchell2019benchmarking, title = {Benchmarking of computational error-correction methods for next-generation sequencing data}, author = {Keith Mitchell and Jaqueline J Brito and Igor Mandric and Qiaozhen Wu and Sergey Knyazev and Sei Chang and Lana S Martin and Aaron Karlsberg and Ekaterina Gerasimov and Russell Jared Littman and Brian L Hill and Nicholas C Wu and Harry Taegyun Yang and Kevin Hsieh and Linus Chen and Eli Littman and Taylor Shabani and German Shabanets and Douglas Yao and Ren Sun and Jan Schroeder and Eleazar Eskin and Alex Zelikovsky and Pavel Skums and Mihai Pop and Serghei Mangul}, url = {https://doi.org/10.1186/s13059-020-01988-3}, doi = {10.1186/s13059-020-01988-3}, year = {2020}, date = {2020-03-17}, journal = {Genome Biology}, volume = {21}, number = {71}, publisher = {Cold Spring Harbor Laboratory}, abstract = {Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown. |
Mangul, Serghei; Mosqueiro, Thiago; Abdill, Richard J; Duong, Dat; Mitchell, Keith; Sarwal, Varuni; Hill, Brian L; Brito, Jaqueline J; Littman, Russell Jared; Statz, Benjamin Challenges and recommendations to improve the installability and archival stability of omics computational tools Journal Article PLoS Biology, 17 (6), pp. e3000333, 2019. Abstract | Links | BibTeX | Altmetric @article{mangul2019challenges, title = {Challenges and recommendations to improve the installability and archival stability of omics computational tools}, author = {Serghei Mangul and Thiago Mosqueiro and Richard J Abdill and Dat Duong and Keith Mitchell and Varuni Sarwal and Brian L Hill and Jaqueline J Brito and Russell Jared Littman and Benjamin Statz}, url = {https://doi.org/10.1371/journal.pbio.3000333}, doi = {10.1371/journal.pbio.3000333}, year = {2019}, date = {2019-06-20}, journal = {PLoS Biology}, volume = {17}, number = {6}, pages = {e3000333}, publisher = {Public Library of Science}, abstract = {Developing new software tools for analysis of large-scale biological data is a key component of advancing modern biomedical research. Scientific reproduction of published findings requires running computational tools on data generated by such studies, yet little attention is presently allocated to the installability and archival stability of computational software tools. Scientific journals require data and code sharing, but none currently require authors to guarantee the continuing functionality of newly published tools. We have estimated the archival stability of computational biology software tools by performing an empirical analysis of the internet presence for 36,702 omics software resources published from 2005 to 2017. We found that almost 28% of all resources are currently not accessible through uniform resource locators (URLs) published in the paper they first appeared in. Among the 98 software tools selected for our installability test, 51% were deemed “easy to install,” and 28% of the tools failed to be installed at all because of problems in the implementation. Moreover, for papers introducing new software, we found that the number of citations significantly increased when authors provided an easy installation process. We propose for incorporation into journal policy several practical solutions for increasing the widespread installability and archival stability of published bioinformatics software.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Developing new software tools for analysis of large-scale biological data is a key component of advancing modern biomedical research. Scientific reproduction of published findings requires running computational tools on data generated by such studies, yet little attention is presently allocated to the installability and archival stability of computational software tools. Scientific journals require data and code sharing, but none currently require authors to guarantee the continuing functionality of newly published tools. We have estimated the archival stability of computational biology software tools by performing an empirical analysis of the internet presence for 36,702 omics software resources published from 2005 to 2017. We found that almost 28% of all resources are currently not accessible through uniform resource locators (URLs) published in the paper they first appeared in. Among the 98 software tools selected for our installability test, 51% were deemed “easy to install,” and 28% of the tools failed to be installed at all because of problems in the implementation. Moreover, for papers introducing new software, we found that the number of citations significantly increased when authors provided an easy installation process. We propose for incorporation into journal policy several practical solutions for increasing the widespread installability and archival stability of published bioinformatics software. |
Mangul Lab papers authored by Angela
Mangul, Serghei; Martin, Lana S; Hill, Brian L; Lam, Angela Ka-Mei; Distler, Margaret G; Zelikovsky, Alex; Eskin, Eleazar; Flint, Jonathan Systematic benchmarking of omics computational tools Journal Article Nature Communications, 10 (1393), pp. 1-11, 2019. Abstract | Links | BibTeX | Altmetric @article{mangul2019systematic, title = {Systematic benchmarking of omics computational tools}, author = {Serghei Mangul and Lana S Martin and Brian L Hill and Angela Ka-Mei Lam and Margaret G Distler and Alex Zelikovsky and Eleazar Eskin and Jonathan Flint}, url = {https://doi.org/10.1038/s41467-019-09406-4}, doi = {10.1038/s41467-019-09406-4}, year = {2019}, date = {2019-03-27}, journal = {Nature Communications}, volume = {10}, number = {1393}, pages = {1-11}, publisher = {Nature Publishing Group doi=10.1038/s41467-019-09406-4 url=https://doi.org/10.1038/s41467-019-09406-4}, abstract = {Computational omics methods packaged as software have become essential to modern biological research. The increasing dependence of scientists on these powerful software tools creates a need for systematic assessment of these methods, known as benchmarking. Adopting a standardized benchmarking practice could help researchers who use omics data to better leverage recent technological innovations. Our review summarizes benchmarking practices from 25 recent studies and discusses the challenges, advantages, and limitations of benchmarking across various domains of biology. We also propose principles that can make computational biology benchmarking studies more sustainable and reproducible, ultimately increasing the transparency of biomedical data and results.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Computational omics methods packaged as software have become essential to modern biological research. The increasing dependence of scientists on these powerful software tools creates a need for systematic assessment of these methods, known as benchmarking. Adopting a standardized benchmarking practice could help researchers who use omics data to better leverage recent technological innovations. Our review summarizes benchmarking practices from 25 recent studies and discusses the challenges, advantages, and limitations of benchmarking across various domains of biology. We also propose principles that can make computational biology benchmarking studies more sustainable and reproducible, ultimately increasing the transparency of biomedical data and results. |
Mangul Lab papers authored by Benjamin
Mandric, Igor; Rotman, Jeremy; Yang, Harry Taegyun; Strauli, Nicolas; Montoya, Dennis; Lay, Will Van Der; Ronas, Jiem R; Statz, Benjamin; Yao, Douglas; Petrova, Velislava; Zelikovsky, Alex; Spreafico, Roberto; Shifman, Sagiv; Zaitlen, Noah; Rossetti, Maura; Ansel, Mark K; Eskin, Eleazar; Mangul, Serghei Profiling immunoglobulin repertoires across multiple human tissues using RNA sequencing Journal Article Nature Communications, 11 (3126), 2020. Abstract | Links | BibTeX | Altmetric @article{mangul2016profiling, title = {Profiling immunoglobulin repertoires across multiple human tissues using RNA sequencing}, author = {Igor Mandric and Jeremy Rotman and Harry Taegyun Yang and Nicolas Strauli and Dennis Montoya and Will Van Der Lay and Jiem R Ronas and Benjamin Statz and Douglas Yao and Velislava Petrova and Alex Zelikovsky and Roberto Spreafico and Sagiv Shifman and Noah Zaitlen and Maura Rossetti and K. Mark Ansel and Eleazar Eskin and Serghei Mangul}, url = {https://doi.org/10.1038/s41467-020-16857-7}, doi = {10.1038/s41467-020-16857-7}, year = {2020}, date = {2020-06-19}, journal = {Nature Communications}, volume = {11}, number = {3126}, publisher = {Nature Publications}, abstract = {Profiling immunoglobulin (Ig) receptor repertoires with specialized assays can be cost-ineffective and time-consuming. Here we report ImReP, a computational method for rapid and accurate profiling of the Ig repertoire, including the complementary-determining region 3 (CDR3), using regular RNA sequencing data such as those from 8,555 samples across 53 tissues types from 544 individuals in the Genotype-Tissue Expression (GTEx v6) project. Using ImReP and GTEx v6 data, we generate a collection of 3.6 million Ig sequences, termed the atlas of immunoglobulin repertoires (TAIR), across a broad range of tissue types that often do not have reported Ig repertoires information. Moreover, the flow of Ig clonotypes and inter-tissue repertoire similarities across immune-related tissues are also evaluated. In summary, TAIR is one of the largest collections of CDR3 sequences and tissue types, and should serve as an important resource for studying immunological diseases.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Profiling immunoglobulin (Ig) receptor repertoires with specialized assays can be cost-ineffective and time-consuming. Here we report ImReP, a computational method for rapid and accurate profiling of the Ig repertoire, including the complementary-determining region 3 (CDR3), using regular RNA sequencing data such as those from 8,555 samples across 53 tissues types from 544 individuals in the Genotype-Tissue Expression (GTEx v6) project. Using ImReP and GTEx v6 data, we generate a collection of 3.6 million Ig sequences, termed the atlas of immunoglobulin repertoires (TAIR), across a broad range of tissue types that often do not have reported Ig repertoires information. Moreover, the flow of Ig clonotypes and inter-tissue repertoire similarities across immune-related tissues are also evaluated. In summary, TAIR is one of the largest collections of CDR3 sequences and tissue types, and should serve as an important resource for studying immunological diseases. |
Mangul, Serghei; Mosqueiro, Thiago; Abdill, Richard J; Duong, Dat; Mitchell, Keith; Sarwal, Varuni; Hill, Brian L; Brito, Jaqueline J; Littman, Russell Jared; Statz, Benjamin Challenges and recommendations to improve the installability and archival stability of omics computational tools Journal Article PLoS Biology, 17 (6), pp. e3000333, 2019. Abstract | Links | BibTeX | Altmetric @article{mangul2019challenges, title = {Challenges and recommendations to improve the installability and archival stability of omics computational tools}, author = {Serghei Mangul and Thiago Mosqueiro and Richard J Abdill and Dat Duong and Keith Mitchell and Varuni Sarwal and Brian L Hill and Jaqueline J Brito and Russell Jared Littman and Benjamin Statz}, url = {https://doi.org/10.1371/journal.pbio.3000333}, doi = {10.1371/journal.pbio.3000333}, year = {2019}, date = {2019-06-20}, journal = {PLoS Biology}, volume = {17}, number = {6}, pages = {e3000333}, publisher = {Public Library of Science}, abstract = {Developing new software tools for analysis of large-scale biological data is a key component of advancing modern biomedical research. Scientific reproduction of published findings requires running computational tools on data generated by such studies, yet little attention is presently allocated to the installability and archival stability of computational software tools. Scientific journals require data and code sharing, but none currently require authors to guarantee the continuing functionality of newly published tools. We have estimated the archival stability of computational biology software tools by performing an empirical analysis of the internet presence for 36,702 omics software resources published from 2005 to 2017. We found that almost 28% of all resources are currently not accessible through uniform resource locators (URLs) published in the paper they first appeared in. Among the 98 software tools selected for our installability test, 51% were deemed “easy to install,” and 28% of the tools failed to be installed at all because of problems in the implementation. Moreover, for papers introducing new software, we found that the number of citations significantly increased when authors provided an easy installation process. We propose for incorporation into journal policy several practical solutions for increasing the widespread installability and archival stability of published bioinformatics software.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Developing new software tools for analysis of large-scale biological data is a key component of advancing modern biomedical research. Scientific reproduction of published findings requires running computational tools on data generated by such studies, yet little attention is presently allocated to the installability and archival stability of computational software tools. Scientific journals require data and code sharing, but none currently require authors to guarantee the continuing functionality of newly published tools. We have estimated the archival stability of computational biology software tools by performing an empirical analysis of the internet presence for 36,702 omics software resources published from 2005 to 2017. We found that almost 28% of all resources are currently not accessible through uniform resource locators (URLs) published in the paper they first appeared in. Among the 98 software tools selected for our installability test, 51% were deemed “easy to install,” and 28% of the tools failed to be installed at all because of problems in the implementation. Moreover, for papers introducing new software, we found that the number of citations significantly increased when authors provided an easy installation process. We propose for incorporation into journal policy several practical solutions for increasing the widespread installability and archival stability of published bioinformatics software. |
Mangul Lab papers authored by Will
Mandric, Igor; Rotman, Jeremy; Yang, Harry Taegyun; Strauli, Nicolas; Montoya, Dennis; Lay, Will Van Der; Ronas, Jiem R; Statz, Benjamin; Yao, Douglas; Petrova, Velislava; Zelikovsky, Alex; Spreafico, Roberto; Shifman, Sagiv; Zaitlen, Noah; Rossetti, Maura; Ansel, Mark K; Eskin, Eleazar; Mangul, Serghei Profiling immunoglobulin repertoires across multiple human tissues using RNA sequencing Journal Article Nature Communications, 11 (3126), 2020. Abstract | Links | BibTeX | Altmetric @article{mangul2016profiling, title = {Profiling immunoglobulin repertoires across multiple human tissues using RNA sequencing}, author = {Igor Mandric and Jeremy Rotman and Harry Taegyun Yang and Nicolas Strauli and Dennis Montoya and Will Van Der Lay and Jiem R Ronas and Benjamin Statz and Douglas Yao and Velislava Petrova and Alex Zelikovsky and Roberto Spreafico and Sagiv Shifman and Noah Zaitlen and Maura Rossetti and K. Mark Ansel and Eleazar Eskin and Serghei Mangul}, url = {https://doi.org/10.1038/s41467-020-16857-7}, doi = {10.1038/s41467-020-16857-7}, year = {2020}, date = {2020-06-19}, journal = {Nature Communications}, volume = {11}, number = {3126}, publisher = {Nature Publications}, abstract = {Profiling immunoglobulin (Ig) receptor repertoires with specialized assays can be cost-ineffective and time-consuming. Here we report ImReP, a computational method for rapid and accurate profiling of the Ig repertoire, including the complementary-determining region 3 (CDR3), using regular RNA sequencing data such as those from 8,555 samples across 53 tissues types from 544 individuals in the Genotype-Tissue Expression (GTEx v6) project. Using ImReP and GTEx v6 data, we generate a collection of 3.6 million Ig sequences, termed the atlas of immunoglobulin repertoires (TAIR), across a broad range of tissue types that often do not have reported Ig repertoires information. Moreover, the flow of Ig clonotypes and inter-tissue repertoire similarities across immune-related tissues are also evaluated. In summary, TAIR is one of the largest collections of CDR3 sequences and tissue types, and should serve as an important resource for studying immunological diseases.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Profiling immunoglobulin (Ig) receptor repertoires with specialized assays can be cost-ineffective and time-consuming. Here we report ImReP, a computational method for rapid and accurate profiling of the Ig repertoire, including the complementary-determining region 3 (CDR3), using regular RNA sequencing data such as those from 8,555 samples across 53 tissues types from 544 individuals in the Genotype-Tissue Expression (GTEx v6) project. Using ImReP and GTEx v6 data, we generate a collection of 3.6 million Ig sequences, termed the atlas of immunoglobulin repertoires (TAIR), across a broad range of tissue types that often do not have reported Ig repertoires information. Moreover, the flow of Ig clonotypes and inter-tissue repertoire similarities across immune-related tissues are also evaluated. In summary, TAIR is one of the largest collections of CDR3 sequences and tissue types, and should serve as an important resource for studying immunological diseases. |
Mangul Lab papers authored by Harry
Mandric, Igor; Rotman, Jeremy; Yang, Harry Taegyun; Strauli, Nicolas; Montoya, Dennis; Lay, Will Van Der; Ronas, Jiem R; Statz, Benjamin; Yao, Douglas; Petrova, Velislava; Zelikovsky, Alex; Spreafico, Roberto; Shifman, Sagiv; Zaitlen, Noah; Rossetti, Maura; Ansel, Mark K; Eskin, Eleazar; Mangul, Serghei Profiling immunoglobulin repertoires across multiple human tissues using RNA sequencing Journal Article Nature Communications, 11 (3126), 2020. Abstract | Links | BibTeX | Altmetric @article{mangul2016profiling, title = {Profiling immunoglobulin repertoires across multiple human tissues using RNA sequencing}, author = {Igor Mandric and Jeremy Rotman and Harry Taegyun Yang and Nicolas Strauli and Dennis Montoya and Will Van Der Lay and Jiem R Ronas and Benjamin Statz and Douglas Yao and Velislava Petrova and Alex Zelikovsky and Roberto Spreafico and Sagiv Shifman and Noah Zaitlen and Maura Rossetti and K. Mark Ansel and Eleazar Eskin and Serghei Mangul}, url = {https://doi.org/10.1038/s41467-020-16857-7}, doi = {10.1038/s41467-020-16857-7}, year = {2020}, date = {2020-06-19}, journal = {Nature Communications}, volume = {11}, number = {3126}, publisher = {Nature Publications}, abstract = {Profiling immunoglobulin (Ig) receptor repertoires with specialized assays can be cost-ineffective and time-consuming. Here we report ImReP, a computational method for rapid and accurate profiling of the Ig repertoire, including the complementary-determining region 3 (CDR3), using regular RNA sequencing data such as those from 8,555 samples across 53 tissues types from 544 individuals in the Genotype-Tissue Expression (GTEx v6) project. Using ImReP and GTEx v6 data, we generate a collection of 3.6 million Ig sequences, termed the atlas of immunoglobulin repertoires (TAIR), across a broad range of tissue types that often do not have reported Ig repertoires information. Moreover, the flow of Ig clonotypes and inter-tissue repertoire similarities across immune-related tissues are also evaluated. In summary, TAIR is one of the largest collections of CDR3 sequences and tissue types, and should serve as an important resource for studying immunological diseases.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Profiling immunoglobulin (Ig) receptor repertoires with specialized assays can be cost-ineffective and time-consuming. Here we report ImReP, a computational method for rapid and accurate profiling of the Ig repertoire, including the complementary-determining region 3 (CDR3), using regular RNA sequencing data such as those from 8,555 samples across 53 tissues types from 544 individuals in the Genotype-Tissue Expression (GTEx v6) project. Using ImReP and GTEx v6 data, we generate a collection of 3.6 million Ig sequences, termed the atlas of immunoglobulin repertoires (TAIR), across a broad range of tissue types that often do not have reported Ig repertoires information. Moreover, the flow of Ig clonotypes and inter-tissue repertoire similarities across immune-related tissues are also evaluated. In summary, TAIR is one of the largest collections of CDR3 sequences and tissue types, and should serve as an important resource for studying immunological diseases. |
Mitchell, Keith; Brito, Jaqueline J; Mandric, Igor; Wu, Qiaozhen; Knyazev, Sergey; Chang, Sei; Martin, Lana S; Karlsberg, Aaron; Gerasimov, Ekaterina; Littman, Russell Jared; Hill, Brian L; Wu, Nicholas C; Yang, Harry Taegyun; Hsieh, Kevin; Chen, Linus; Littman, Eli; Shabani, Taylor; Shabanets, German; Yao, Douglas; Sun, Ren; Schroeder, Jan; Eskin, Eleazar; Zelikovsky, Alex; Skums, Pavel; Pop, Mihai; Mangul, Serghei Benchmarking of computational error-correction methods for next-generation sequencing data Journal Article Genome Biology, 21 (71), 2020. Abstract | Links | BibTeX | Altmetric @article{mitchell2019benchmarking, title = {Benchmarking of computational error-correction methods for next-generation sequencing data}, author = {Keith Mitchell and Jaqueline J Brito and Igor Mandric and Qiaozhen Wu and Sergey Knyazev and Sei Chang and Lana S Martin and Aaron Karlsberg and Ekaterina Gerasimov and Russell Jared Littman and Brian L Hill and Nicholas C Wu and Harry Taegyun Yang and Kevin Hsieh and Linus Chen and Eli Littman and Taylor Shabani and German Shabanets and Douglas Yao and Ren Sun and Jan Schroeder and Eleazar Eskin and Alex Zelikovsky and Pavel Skums and Mihai Pop and Serghei Mangul}, url = {https://doi.org/10.1186/s13059-020-01988-3}, doi = {10.1186/s13059-020-01988-3}, year = {2020}, date = {2020-03-17}, journal = {Genome Biology}, volume = {21}, number = {71}, publisher = {Cold Spring Harbor Laboratory}, abstract = {Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown. |
Alser, Mohammed; Rotman, Jeremy; Taraszka, Kodi; Shi, Huwenbo; Baykal, Pelin Icer; Yang, Harry Taegyun; Xue, Victor; Knyazev, Sergey; Singer, Benjamin D; Balliu, Brunilda; Koslicki, David; Skums, Pavel; Zelikovsky, Alex; Alkan, Can; Mutlu, Onur; Mangul, Serghei Technology dictates algorithms: Recent developments in read alignment Journal Article arXiv, 2020. Abstract | Links | BibTeX | Altmetric @article{Alser2020, title = {Technology dictates algorithms: Recent developments in read alignment}, author = {Mohammed Alser and Jeremy Rotman and Kodi Taraszka and Huwenbo Shi and Pelin Icer Baykal and Harry Taegyun Yang and Victor Xue and Sergey Knyazev and Benjamin D Singer and Brunilda Balliu and David Koslicki and Pavel Skums and Alex Zelikovsky and Can Alkan and Onur Mutlu and Serghei Mangul}, url = {https://arxiv.org/abs/2003.00110}, doi = {2003.00110}, year = {2020}, date = {2020-02-28}, journal = {arXiv}, abstract = {Massively parallel sequencing techniques have revolutionized biological and medical sciences by providing unprecedented insight into the genomes of humans, animals, and microbes. Aligned reads are essential for answering important biological questions, such as detecting mutations driving various human diseases and complex traits as well as identifying species present in metagenomic samples. The read alignment problem is extremely challenging due to the large size of analyzed datasets and numerous technological limitations of sequencing platforms, and researchers have developed novel bioinformatics algorithms to tackle these difficulties. Our review provides a survey of algorithmic foundations and methodologies across alignment methods for both short and long reads. We provide rigorous experimental evaluation of 11 read aligners to demonstrate the effect of these underlying algorithms on speed and efficiency of read aligners. We separately discuss how longer read lengths produce unique advantages and limitations to read alignment techniques. We also discuss how general alignment algorithms have been tailored to the specific needs of various domains in biology, including whole transcriptome, adaptive immune repertoire, and human microbiome studies.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Massively parallel sequencing techniques have revolutionized biological and medical sciences by providing unprecedented insight into the genomes of humans, animals, and microbes. Aligned reads are essential for answering important biological questions, such as detecting mutations driving various human diseases and complex traits as well as identifying species present in metagenomic samples. The read alignment problem is extremely challenging due to the large size of analyzed datasets and numerous technological limitations of sequencing platforms, and researchers have developed novel bioinformatics algorithms to tackle these difficulties. Our review provides a survey of algorithmic foundations and methodologies across alignment methods for both short and long reads. We provide rigorous experimental evaluation of 11 read aligners to demonstrate the effect of these underlying algorithms on speed and efficiency of read aligners. We separately discuss how longer read lengths produce unique advantages and limitations to read alignment techniques. We also discuss how general alignment algorithms have been tailored to the specific needs of various domains in biology, including whole transcriptome, adaptive immune repertoire, and human microbiome studies. |
Mangul, Serghei; Yang, Harry Taegyun; Eskin, Eleazar; Zaitlen, Noah Hidden Treasures in Contemporary RNA Sequencing Book Chapter Hidden Treasures in Contemporary RNA Sequencing. SpringerBriefs in Computer Science, pp. 1–93, Springer, 2019. Abstract | Links | BibTeX | Altmetric @inbook{mangul2019hidden, title = {Hidden Treasures in Contemporary RNA Sequencing}, author = {Serghei Mangul and Harry Taegyun Yang and Eleazar Eskin and Noah Zaitlen}, url = {https://doi.org/10.1007/978-3-030-13973-5_1}, doi = {10.1007/978-3-030-13973-5_1}, year = {2019}, date = {2019-03-02}, booktitle = {Hidden Treasures in Contemporary RNA Sequencing. SpringerBriefs in Computer Science}, pages = {1--93}, publisher = {Springer}, abstract = {High throughput RNA sequencing technologies have provided unprecedented opportunity to explore the individual transcriptome. Unmapped reads, the reads falling to map to the human reference, are a large and often overlooked output of standard RNA-Seq analyses; the hidden treasure in the contemporary RNA-Seq analysis is within the unmapped reads, illuminating previously unexplored biological insights. Here we develop Read Origin Protocol (ROP) to discover the source of all reads originating from complex RNA molecules, recombinant T and B cell receptors, and microbial communities. We applied ROP to 10,641 samples across 2630 individuals from 54 diverse adult human tissues. Our approach can account for 99.9% of 1 trillion reads of various read length. Using in-house RNA-Seq data, we show that immune profiles of asthmatic individuals are significantly different from the profiles of control individuals, with decreased average per sample T and B cell receptor diversity. We also show that microbiomes can be detected in human bloods via RNA-Sequencing and may elucidate important clinical changes in patients with schizophrenia. Furthermore, we demonstrate that receptor-derived reads among other hidden reads can be used to characterize the overall Ig repertoire across diverse human tissues using RNA-Sequencing. Our results demonstrate the potential of ROP to exploit the hidden treasure in contemporary RNA-Sequencing in order to better understand the functional mechanisms underlying connections between the immune system, microbiome, human gene expression, and disease etiology.}, keywords = {}, pubstate = {published}, tppubtype = {inbook} } High throughput RNA sequencing technologies have provided unprecedented opportunity to explore the individual transcriptome. Unmapped reads, the reads falling to map to the human reference, are a large and often overlooked output of standard RNA-Seq analyses; the hidden treasure in the contemporary RNA-Seq analysis is within the unmapped reads, illuminating previously unexplored biological insights. Here we develop Read Origin Protocol (ROP) to discover the source of all reads originating from complex RNA molecules, recombinant T and B cell receptors, and microbial communities. We applied ROP to 10,641 samples across 2630 individuals from 54 diverse adult human tissues. Our approach can account for 99.9% of 1 trillion reads of various read length. Using in-house RNA-Seq data, we show that immune profiles of asthmatic individuals are significantly different from the profiles of control individuals, with decreased average per sample T and B cell receptor diversity. We also show that microbiomes can be detected in human bloods via RNA-Sequencing and may elucidate important clinical changes in patients with schizophrenia. Furthermore, we demonstrate that receptor-derived reads among other hidden reads can be used to characterize the overall Ig repertoire across diverse human tissues using RNA-Sequencing. Our results demonstrate the potential of ROP to exploit the hidden treasure in contemporary RNA-Sequencing in order to better understand the functional mechanisms underlying connections between the immune system, microbiome, human gene expression, and disease etiology. |
Loohuis, Loes Olde M; Mangul, Serghei; Ori, Anil PS; Jospin, Guillaume; Koslicki, David; Yang, Harry Taegyun; Wu, Timothy; Boks, Marco P; Lomen-Hoerth, Catherine; Wiedau-Pazos, Martina Transcriptome analysis in whole blood reveals increased microbial diversity in schizophrenia Journal Article Translational Psychiatry, 8 (1), pp. 96, 2018. Abstract | Links | BibTeX | Altmetric @article{loohuis2018transcriptome, title = {Transcriptome analysis in whole blood reveals increased microbial diversity in schizophrenia}, author = {Loes Olde M Loohuis and Serghei Mangul and Anil PS Ori and Guillaume Jospin and David Koslicki and Harry Taegyun Yang and Timothy Wu and Marco P Boks and Catherine Lomen-Hoerth and Martina Wiedau-Pazos}, url = {https://doi.org/10.1038/s41398-018-0107-9}, doi = {10.1038/s41398-018-0107-9}, year = {2018}, date = {2018-05-10}, journal = {Translational Psychiatry}, volume = {8}, number = {1}, pages = {96}, publisher = {Nature Publishing Group}, abstract = {The role of the human microbiome in health and disease is increasingly appreciated. We studied the composition of microbial communities present in blood across 192 individuals, including healthy controls and patients with three disorders affecting the brain: schizophrenia, amyotrophic lateral sclerosis, and bipolar disorder. By using high-quality unmapped RNA sequencing reads as candidate microbial reads, we performed profiling of microbial transcripts detected in whole blood. We were able to detect a wide range of bacterial and archaeal phyla in blood. Interestingly, we observed an increased microbial diversity in schizophrenia patients compared to the three other groups. We replicated this finding in an independent schizophrenia case–control cohort. This increased diversity is inversely correlated with estimated cell abundance of a subpopulation of CD8+ memory T cells in healthy controls, supporting a link between microbial products found in blood, immunity and schizophrenia.}, keywords = {}, pubstate = {published}, tppubtype = {article} } The role of the human microbiome in health and disease is increasingly appreciated. We studied the composition of microbial communities present in blood across 192 individuals, including healthy controls and patients with three disorders affecting the brain: schizophrenia, amyotrophic lateral sclerosis, and bipolar disorder. By using high-quality unmapped RNA sequencing reads as candidate microbial reads, we performed profiling of microbial transcripts detected in whole blood. We were able to detect a wide range of bacterial and archaeal phyla in blood. Interestingly, we observed an increased microbial diversity in schizophrenia patients compared to the three other groups. We replicated this finding in an independent schizophrenia case–control cohort. This increased diversity is inversely correlated with estimated cell abundance of a subpopulation of CD8+ memory T cells in healthy controls, supporting a link between microbial products found in blood, immunity and schizophrenia. |
Mangul, Serghei; Yang, Harry Taegyun; Strauli, Nicolas; Gruhl, Franziska; Porath, Hagit T; Hsieh, Kevin; Chen, Linus; Daley, Timothy; Christenson, Stephanie; Wesolowska-Andersen, Agata ROP: dumpster diving in RNA-sequencing to find the source of 1 trillion reads across diverse adult human tissues Journal Article Genome Biology, 19 (1), pp. 36, 2018. Abstract | Links | BibTeX | Altmetric @article{mangul2018rop, title = {ROP: dumpster diving in RNA-sequencing to find the source of 1 trillion reads across diverse adult human tissues}, author = {Serghei Mangul and Harry Taegyun Yang and Nicolas Strauli and Franziska Gruhl and Hagit T Porath and Kevin Hsieh and Linus Chen and Timothy Daley and Stephanie Christenson and Agata Wesolowska-Andersen}, url = {https://doi.org/10.1186/s13059-018-1403-7}, doi = {10.1186/s13059-018-1403-7}, year = {2018}, date = {2018-02-02}, journal = {Genome Biology}, volume = {19}, number = {1}, pages = {36}, publisher = {BioMed Central}, abstract = {High-throughput RNA-sequencing (RNA-seq) technologies provide an unprecedented opportunity to explore the individual transcriptome. Unmapped reads are a large and often overlooked output of standard RNA-seq analyses. Here, we present Read Origin Protocol (ROP), a tool for discovering the source of all reads originating from complex RNA molecules. We apply ROP to samples across 2630 individuals from 54 diverse human tissues. Our approach can account for 99.9% of 1 trillion reads of various read length. Additionally, we use ROP to investigate the functional mechanisms underlying connections between the immune system, microbiome, and disease. ROP is freely available at https://github.com/smangul1/rop/wiki.}, keywords = {}, pubstate = {published}, tppubtype = {article} } High-throughput RNA-sequencing (RNA-seq) technologies provide an unprecedented opportunity to explore the individual transcriptome. Unmapped reads are a large and often overlooked output of standard RNA-seq analyses. Here, we present Read Origin Protocol (ROP), a tool for discovering the source of all reads originating from complex RNA molecules. We apply ROP to samples across 2630 individuals from 54 diverse human tissues. Our approach can account for 99.9% of 1 trillion reads of various read length. Additionally, we use ROP to investigate the functional mechanisms underlying connections between the immune system, microbiome, and disease. ROP is freely available at https://github.com/smangul1/rop/wiki. |
Mangul, Serghei; Yang, Harry Taegyun; Hormozdiari, Farhad; Dainis, Alexandra Marie; Tseng, Elizabeth; Ashley, Euan A; Zelikovsky, Alex; Eskin, Eleazar HapIso: an accurate method for the haplotype-specific isoforms reconstruction from long single-molecule reads Journal Article IEEE Transactions on Nanobioscience, 16 (2), pp. 108–115, 2017. Abstract | Links | BibTeX | Altmetric @article{mangul2017hapiso, title = {HapIso: an accurate method for the haplotype-specific isoforms reconstruction from long single-molecule reads}, author = {Serghei Mangul and Harry Taegyun Yang and Farhad Hormozdiari and Alexandra Marie Dainis and Elizabeth Tseng and Euan A Ashley and Alex Zelikovsky and Eleazar Eskin}, url = {https://doi.org/10.1109/TNB.2017.2675981}, doi = {10.1109/TNB.2017.2675981}, year = {2017}, date = {2017-03-17}, journal = {IEEE Transactions on Nanobioscience}, volume = {16}, number = {2}, pages = {108--115}, publisher = {IEEE}, abstract = {Sequencing of RNA provides the possibility to study an individual's transcriptome landscape and determine allelic expression ratios. Single-molecule protocols generate multi-kilobase reads longer than most transcripts, allowing sequencing of complete haplotype isoforms. This allows partitioning the reads into two parental haplotypes. While the read length of the single-molecule protocols is long, the relatively high error rate limits the ability to accurately detect the genetic variants and assemble them into the haplotype-specific isoforms. In this paper, we present Haplotype-specific Isoform reconstruction (HapIso), a method able to tolerate the relatively high error rate of the single-molecule platform and partition the isoform reads into the parental alleles. Phasing the reads according to the allele of origin allows our method to efficiently distinguish between the read errors and the true biological mutations. HapIso uses a k -means clustering algorithm aiming to group the reads into two meaningful clusters maximizing the similarity of the reads within the cluster and minimizing the similarity of the reads from different clusters. Each cluster corresponds to a parental haplotype. We used the family pedigree information to evaluate our approach. Experimental validation suggests that HapIso is able to tolerate the relatively high error rate and accurately partition the reads into the parental alleles of the isoform transcripts. We also applied HapIso to novel clinical single-molecule RNA-Seq data to estimate allele-specific expression of genes of interest. Our method was able to correct reads and determine Glu1883Lys point mutation of clinical significance validated by GeneDx HCM panel. Furthermore, our method is the first method able to reconstruct the haplotype-specific isoforms from long single-molecule reads}, keywords = {}, pubstate = {published}, tppubtype = {article} } Sequencing of RNA provides the possibility to study an individual's transcriptome landscape and determine allelic expression ratios. Single-molecule protocols generate multi-kilobase reads longer than most transcripts, allowing sequencing of complete haplotype isoforms. This allows partitioning the reads into two parental haplotypes. While the read length of the single-molecule protocols is long, the relatively high error rate limits the ability to accurately detect the genetic variants and assemble them into the haplotype-specific isoforms. In this paper, we present Haplotype-specific Isoform reconstruction (HapIso), a method able to tolerate the relatively high error rate of the single-molecule platform and partition the isoform reads into the parental alleles. Phasing the reads according to the allele of origin allows our method to efficiently distinguish between the read errors and the true biological mutations. HapIso uses a k -means clustering algorithm aiming to group the reads into two meaningful clusters maximizing the similarity of the reads within the cluster and minimizing the similarity of the reads from different clusters. Each cluster corresponds to a parental haplotype. We used the family pedigree information to evaluate our approach. Experimental validation suggests that HapIso is able to tolerate the relatively high error rate and accurately partition the reads into the parental alleles of the isoform transcripts. We also applied HapIso to novel clinical single-molecule RNA-Seq data to estimate allele-specific expression of genes of interest. Our method was able to correct reads and determine Glu1883Lys point mutation of clinical significance validated by GeneDx HCM panel. Furthermore, our method is the first method able to reconstruct the haplotype-specific isoforms from long single-molecule reads |
Mangul Lab papers authored by German
Mitchell, Keith; Brito, Jaqueline J; Mandric, Igor; Wu, Qiaozhen; Knyazev, Sergey; Chang, Sei; Martin, Lana S; Karlsberg, Aaron; Gerasimov, Ekaterina; Littman, Russell Jared; Hill, Brian L; Wu, Nicholas C; Yang, Harry Taegyun; Hsieh, Kevin; Chen, Linus; Littman, Eli; Shabani, Taylor; Shabanets, German; Yao, Douglas; Sun, Ren; Schroeder, Jan; Eskin, Eleazar; Zelikovsky, Alex; Skums, Pavel; Pop, Mihai; Mangul, Serghei Benchmarking of computational error-correction methods for next-generation sequencing data Journal Article Genome Biology, 21 (71), 2020. Abstract | Links | BibTeX | Altmetric @article{mitchell2019benchmarking, title = {Benchmarking of computational error-correction methods for next-generation sequencing data}, author = {Keith Mitchell and Jaqueline J Brito and Igor Mandric and Qiaozhen Wu and Sergey Knyazev and Sei Chang and Lana S Martin and Aaron Karlsberg and Ekaterina Gerasimov and Russell Jared Littman and Brian L Hill and Nicholas C Wu and Harry Taegyun Yang and Kevin Hsieh and Linus Chen and Eli Littman and Taylor Shabani and German Shabanets and Douglas Yao and Ren Sun and Jan Schroeder and Eleazar Eskin and Alex Zelikovsky and Pavel Skums and Mihai Pop and Serghei Mangul}, url = {https://doi.org/10.1186/s13059-020-01988-3}, doi = {10.1186/s13059-020-01988-3}, year = {2020}, date = {2020-03-17}, journal = {Genome Biology}, volume = {21}, number = {71}, publisher = {Cold Spring Harbor Laboratory}, abstract = {Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown. |
Mangul Lab papers authored by Taylor
Mitchell, Keith; Brito, Jaqueline J; Mandric, Igor; Wu, Qiaozhen; Knyazev, Sergey; Chang, Sei; Martin, Lana S; Karlsberg, Aaron; Gerasimov, Ekaterina; Littman, Russell Jared; Hill, Brian L; Wu, Nicholas C; Yang, Harry Taegyun; Hsieh, Kevin; Chen, Linus; Littman, Eli; Shabani, Taylor; Shabanets, German; Yao, Douglas; Sun, Ren; Schroeder, Jan; Eskin, Eleazar; Zelikovsky, Alex; Skums, Pavel; Pop, Mihai; Mangul, Serghei Benchmarking of computational error-correction methods for next-generation sequencing data Journal Article Genome Biology, 21 (71), 2020. Abstract | Links | BibTeX | Altmetric @article{mitchell2019benchmarking, title = {Benchmarking of computational error-correction methods for next-generation sequencing data}, author = {Keith Mitchell and Jaqueline J Brito and Igor Mandric and Qiaozhen Wu and Sergey Knyazev and Sei Chang and Lana S Martin and Aaron Karlsberg and Ekaterina Gerasimov and Russell Jared Littman and Brian L Hill and Nicholas C Wu and Harry Taegyun Yang and Kevin Hsieh and Linus Chen and Eli Littman and Taylor Shabani and German Shabanets and Douglas Yao and Ren Sun and Jan Schroeder and Eleazar Eskin and Alex Zelikovsky and Pavel Skums and Mihai Pop and Serghei Mangul}, url = {https://doi.org/10.1186/s13059-020-01988-3}, doi = {10.1186/s13059-020-01988-3}, year = {2020}, date = {2020-03-17}, journal = {Genome Biology}, volume = {21}, number = {71}, publisher = {Cold Spring Harbor Laboratory}, abstract = {Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Recent advancements in next-generation sequencing have rapidly improved our ability to study genomic material at an unprecedented scale. Despite substantial improvements in sequencing technologies, errors present in the data still risk confounding downstream analysis and limiting the applicability of sequencing technologies in clinical tools. Computational error correction promises to eliminate sequencing errors, but the relative accuracy of error correction algorithms remains unknown. |
Mangul Lab papers authored by Kevin
Thomas, Brandon; Karimzada, Mohammed; Spreafico, Roberto; Mangul, Serghei; Botten, Jason W; Rotman, Jeremy; Wesel, Kevin; Binder, Pratibha S; Gharavi, Nima; Chesnut, Robert W 104 Lack of human papilloma virus transcription in cutaneous squamous cell carcinoma stratified by histological grade and host immune status Journal Article Journal of Investigative Dermatology, 137 (5), pp. S18, 2017. Abstract | Links | BibTeX | Altmetric @article{thomas2017104, title = {104 Lack of human papilloma virus transcription in cutaneous squamous cell carcinoma stratified by histological grade and host immune status}, author = {Brandon Thomas and Mohammed Karimzada and Roberto Spreafico and Serghei Mangul and Jason W Botten and Jeremy Rotman and Kevin Wesel and Pratibha S Binder and Nima Gharavi and Robert W Chesnut}, url = {https://doi.org/10.1016/j.jid.2017.02.118}, doi = {10.1016/j.jid.2017.02.118}, year = {2017}, date = {2017-01-01}, journal = {Journal of Investigative Dermatology}, volume = {137}, number = {5}, pages = {S18}, publisher = {Elsevier}, abstract = {Human Papilloma Virus (HPV) infection is known to contribute to mucosal (m)SCC, but its role in cutaneous (c)SCC progression remains unclear, especially in lesions determined to be at high-risk for metastasis. We hypothesized that histologically high grade cSCCs in immunosuppressed patients would display increased transcriptional activity of HPV when compared to low histologic grade lesions in otherwise healthy patients. To assess the role of viruses in cSCC pathogenesis we utilized high throughput RNA sequencing across risk-stratified lesions. A total of 22 skin excisions (11 classified as high grade in immunocompromised patients, 8 classified as low grade in otherwise healthy patients, and 3 as normal skin) were used for detection of any non-human RNA. Reads were aligned to known viral transcriptomes using our recently developed Microbiome Coverage Profiler. While approximately two-thirds of all samples tested positive for HPV gDNA, no skin sample had detectable expression of HPV RNA. Instead, many were found to have expression of Human Endogenous Retroviruses, Simian Virus 40, and Staphylococcus Prophages, while analysis of published datasets of sequenced HeLa cells demonstrated numerous RNA reads for HPV. These results suggest that either HPV does not participate in cSCC development, or facilitates cSCC initiation without effecting tumor progression. The ability to monitor viral and prophage gene expression in skin biopsies will provide insights into the interplay of host-pathogen interactions, and the framework described herein can be used to analyze skin biopsies to facilitate understanding in cases where pathogens are thought to contribute to disease pathogenesis.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Human Papilloma Virus (HPV) infection is known to contribute to mucosal (m)SCC, but its role in cutaneous (c)SCC progression remains unclear, especially in lesions determined to be at high-risk for metastasis. We hypothesized that histologically high grade cSCCs in immunosuppressed patients would display increased transcriptional activity of HPV when compared to low histologic grade lesions in otherwise healthy patients. To assess the role of viruses in cSCC pathogenesis we utilized high throughput RNA sequencing across risk-stratified lesions. A total of 22 skin excisions (11 classified as high grade in immunocompromised patients, 8 classified as low grade in otherwise healthy patients, and 3 as normal skin) were used for detection of any non-human RNA. Reads were aligned to known viral transcriptomes using our recently developed Microbiome Coverage Profiler. While approximately two-thirds of all samples tested positive for HPV gDNA, no skin sample had detectable expression of HPV RNA. Instead, many were found to have expression of Human Endogenous Retroviruses, Simian Virus 40, and Staphylococcus Prophages, while analysis of published datasets of sequenced HeLa cells demonstrated numerous RNA reads for HPV. These results suggest that either HPV does not participate in cSCC development, or facilitates cSCC initiation without effecting tumor progression. The ability to monitor viral and prophage gene expression in skin biopsies will provide insights into the interplay of host-pathogen interactions, and the framework described herein can be used to analyze skin biopsies to facilitate understanding in cases where pathogens are thought to contribute to disease pathogenesis. |