2.3 Genome annotation
Repetitive elements, including transposable elements and simple sequence
repeats, were predicted using LTR-FINDER (Xu and Wang, 2007) and
RepeatScout(Price et al., 2005). Repeat types were classified using the
PASTEClassifier (Hoede et al., 2014). The Repbase database(Bao et al.,
2015) was used to scan and identify the predicted repeats in the genome
of O. bidens based on homology-based alignment using RepeatMasker
(Tarailo-Graovac and Chen, 2009).
Gene structure analysis was performed using three combined methods: de
novo prediction, homology-based prediction, and transcriptome-based
prediction. De novo analysis was performed using GenScan(Burton et al.,
2013) and Augustus(Stanke et al., 2006). Genes from the Danio
rerio and Ctenopharyngodon, idellus NCBI database, were used for
homology-based analysis by Gemoma (Keilwagen et al., 2019). Transcript
annotation results from nine O. bidens tissues were used to
perform transcriptome-based prediction using TransDecoder
(http://transdecoder.github.io) and GeneMarkS-T (Tang et al., 2015).
Finally, the three evidence sets were integrated using EVM (Haas et al.,
2008). The microRNAs and rRNAs were identified using Blastn against the
Rfam database (Griffiths-Jones et al., 2005), and tRNAs were identified
using tRNAscan-SE (Lowe and Eddy, 1997).
Gene functional annotations were performed using BLAST with an e-value
of 1e-5 in searching the NCBI non-redundant protein (NR), EuKaryotic
Orthologous Groups (KOG), Gene Ontology (GO), Kyoto Encyclopedia of
Genes and Genomes (KEGG) (Kanehisa and Goto, 2000), and Tremble
databases to identify homologous protein-coding genes (Boeckmann et al.,
2003). GO terms were assigned to genes based on the NR annotation
information. Gene functional classes were determined based on the KOG
database, and functional pathways were analyzed using the KEGG database.
Pseudogenes were identified by searching the identified gene sets using
GeneWise (Birney et al., 2004).
2.4 Genome evolution analysis
The protein sequences of different species downloaded from NCBI,
including Sinocyclocheilus rhinocerous , Anabarilius
grahami, Labeo rohita , Danio rerio , Oryzias
latipes , Takifugu rubripes , Cyprinus carpio , andGasterosteus aculeatus (Table S1), together with O. bidenswere analyzed using an all-to-all BLAST search with an e-value of 1e-7
to obtain orthologous genes. These genes were clustered into families to
identify species-unique gene families using OrthoMCL (Li et al., 2003).
Single-copy orthologous gene clusters were extracted from the OrthoMCL
clustering results. Single-copy gene families were used to construct
phylogenetic trees based on the Bayes model using PhyML (Guindon et al.,
2010). The divergence time was estimated using the PAML Mcmctree with
the JC69 model (Yang, 1997). Several calibration times were verified
using the TimeTree website (http://www.timetree.org).
Gene family size dynamics, including expansion or contraction, were
assessed using Cafe v5.0 based on OrthoMCL’s results and phylogenetic
trees (De Bie et al., 2006). Functional annotations of expanded and
contracted genes were performed using BLAST to search the Pfam database
(Hunter et al., 2009).