2.3 Genome annotation
Repetitive elements, including transposable elements and simple sequence repeats, were predicted using LTR-FINDER (Xu and Wang, 2007) and RepeatScout(Price et al., 2005). Repeat types were classified using the PASTEClassifier (Hoede et al., 2014). The Repbase database(Bao et al., 2015) was used to scan and identify the predicted repeats in the genome of O. bidens based on homology-based alignment using RepeatMasker (Tarailo-Graovac and Chen, 2009).
Gene structure analysis was performed using three combined methods: de novo prediction, homology-based prediction, and transcriptome-based prediction. De novo analysis was performed using GenScan(Burton et al., 2013) and Augustus(Stanke et al., 2006). Genes from the Danio rerio and Ctenopharyngodon, idellus NCBI database, were used for homology-based analysis by Gemoma (Keilwagen et al., 2019). Transcript annotation results from nine O. bidens tissues were used to perform transcriptome-based prediction using TransDecoder (http://transdecoder.github.io) and GeneMarkS-T (Tang et al., 2015). Finally, the three evidence sets were integrated using EVM (Haas et al., 2008). The microRNAs and rRNAs were identified using Blastn against the Rfam database (Griffiths-Jones et al., 2005), and tRNAs were identified using tRNAscan-SE (Lowe and Eddy, 1997).
Gene functional annotations were performed using BLAST with an e-value of 1e-5 in searching the NCBI non-redundant protein (NR), EuKaryotic Orthologous Groups (KOG), Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa and Goto, 2000), and Tremble databases to identify homologous protein-coding genes (Boeckmann et al., 2003). GO terms were assigned to genes based on the NR annotation information. Gene functional classes were determined based on the KOG database, and functional pathways were analyzed using the KEGG database. Pseudogenes were identified by searching the identified gene sets using GeneWise (Birney et al., 2004).
2.4 Genome evolution analysis
The protein sequences of different species downloaded from NCBI, including Sinocyclocheilus rhinocerous , Anabarilius grahami, Labeo rohita , Danio rerio , Oryzias latipes , Takifugu rubripes , Cyprinus carpio , andGasterosteus aculeatus (Table S1), together with O. bidenswere analyzed using an all-to-all BLAST search with an e-value of 1e-7 to obtain orthologous genes. These genes were clustered into families to identify species-unique gene families using OrthoMCL (Li et al., 2003). Single-copy orthologous gene clusters were extracted from the OrthoMCL clustering results. Single-copy gene families were used to construct phylogenetic trees based on the Bayes model using PhyML (Guindon et al., 2010). The divergence time was estimated using the PAML Mcmctree with the JC69 model (Yang, 1997). Several calibration times were verified using the TimeTree website (http://www.timetree.org).
Gene family size dynamics, including expansion or contraction, were assessed using Cafe v5.0 based on OrthoMCL’s results and phylogenetic trees (De Bie et al., 2006). Functional annotations of expanded and contracted genes were performed using BLAST to search the Pfam database (Hunter et al., 2009).