Introduction
DNA markers from the nuclear and plastid genomes have been widely applied for phylogenetic, evolutionary, and ecological studies in the last few decades (Palmer et al., 1985; Palmer & Thompson, 1982; Palmer & Zamir, 1982). Researchers have identified numerous single-to-low-copy nuclear genes (Li et al., 2008; Li et al., 2017; Small et al., 2004; Wu et al., 2006) to estimate the phylogeny for seed plants from transcriptomic and genomic data. Building phylogenetic frameworks from different types of next-generation sequencing (NGS) data and large-scale molecular markers is becoming commonplace and fundamental for other applied biological studies (Wang et al., 2020; Wen et al., 2020). Molecular markers can be obtained from a variety of sequencing resources, including RNA-seq (Wang et al., 2009), Hyb-Seq (Weitemier et al., 2014), and shallow whole-genome sequencing (genome skimming) (Straub et al., 2012). For example, Allen et al. (2017) obtained single-copy orthologs from whole genome sequencing. Zhang et al. (2019) extracted single-copy orthologs and ultra conserved elements from genome skimming. Further, Liu et al. (2021) captured single-copy nuclear genes, organellar genomes, and nuclear ribosomal DNA from deep genome skimming data. aTRAM (Allen et al., 2015, 2018) exploits a BLAST-based iterative search-and-assemble approach to extract specific genes from NGS data (Allen et al., 2017). HybPiper (Johnson et al., 2016) and HybPhyloMaker (Fér & Schmickl, 2018) can filter reads by mapping to the reference using BWA/bowtie2 and subsequently assemble those reads into contigs. We have also designed software for mining NGS data in the past, our previous tool, Easy353 (Zhang et al., 2022), enables researchers to mine Angiosperms353 (Johnson et al., 2019) genes from transcriptome and enriched genome based on the reference-guided de Bruijn graph. However, the aforementioned workflows and tools still present the following challenges for retrieving phylogenetic markers: (1) Some markers have lower/higher coverage than others, leading to unever read coverage; (2) The stability and accuracy of assembly results depend on the chosen reference sequence; (3) The putative paralogs in the assembly results can lead to misestimation of branch lengths; (4) They require high-performance computing servers and advanced bioinformatics skills.
In the realm of phylogenetics, molecular markers are characteristically succinct, with their genomic arrangement frequently deemed inconsequential. Single-to-low-copy orthologous genes, a small subset of genomic data, are often used for phylogenetic studies at the genus level or higher taxonomic levels. It is unnecessary to assemble complete and sophisticated genome sequencing data to obtain these genes. With these considerations, we introduce GeneMiner: a pipeline designed for the extraction of phylogenetic markers from short reads NGS datasets. This pipeline employs our proprietary reference-guided de Bruijn graph construction algorithm. Our algorithm deliberately circumvents the need for independent assembly tools such as SPAdes or Velvet. Compared to other available tools, GeneMiner can captures gene fragments from both transcriptome and genome skimming raw sequencing data more quickly, accurately, and comprehensively. Importantly, GeneMiner can achieve all of this on personal computers. Compared to our previous tool, Easy353, GeneMiner contains several innovative features. These include: (1) no restriction on the type of molecular markers, supporting the direct use of sequences in GenBank format as reference sequences; (2) a verification method to evaluate the accuracy of recovered target genes; (3) an optimized weighted node model to accommodate distantly related reference sequences; (4) a collection of new methods such as re-filtering, re-assembly, and soft boundary to improve assembly capability. Additionally, GeneMiner boasts excellent cross-platform compatibility, supporting Windows, Mac, and Linux operating systems, provides a user-friendly GUI interface for Windows and Mac users (Figure 1-A), and has distinct computational parameters that improve accuracy over other tools in this category (Figure 1-B).