Echinobase Gene Nomenclature Guidelines
Develop standards for naming protein-coding genes, non-coding sequences, and genomic resources (hereafter referred to as ‘sequences’) curated by Echinobase. This nomenclature system should be consistent, intuitive, succinct, and generalizable. Sequence names should also be human- and machine-readable.
Current nomenclature below is constrained to protein-coding genes and respective proteins. Further nomenclature for genes which encode non-coding RNAs is in development.
These guidelines will be used to name the sequences curated by Echinobase. These have been informed by standards employed by other Model Organism Databases (MODs). Echinobase administers the nomenclature for all species currently hosted on the site and future echinoderm species as they are sequenced.
To facilitate broad phylogenetic analyses and simplify investigations across phyla, sequence nomenclature will primarily be informed by orthology, meeting criteria established by the Alliance for Genomic Resources. However, these guidelines also delineate protocols for naming sequences that lack characterized orthologs in other species.
Nomenclature guidelines for genes, non-coding sequences, and genomic resources
For these guidelines, we rely on the following definitions:
Echinobase sequence nomenclature is primarily based on sequence orthology. To assign orthology, we use an approach based on the DRSC Integrative Ortholog Prediction Tool (DIOPT), which integrates output from several algorithms. For each sequence in Echinobase, orthology predictions will be presented for humans (with further expansion to other MODs in consideration). Note that individual orthology tools’ predictions are not definitive, but rather represent the likelihood that the pair is orthologous. Thus, several algorithms that report the same pairing provide additional confidence in the orthology assignment. Following the AGR guidelines, a gene pair must be supported across three or more orthology tools to be considered an ortholog and utilized in the nomenclature pipeline. Genes which have not yet been processed through the nomenclature pipeline will retain provisionally or partially curated gene symbols.
1. Protein-Coding Gene nomenclature
Each gene page is assigned the following:
- Gene name. A descriptive identifier that briefly conveys the structure or function of the gene (in legacy, these descriptors, if present, were listed as synonyms).
- Gene symbol. A short abbreviation for speaking and writing about the gene. (in legacy, such symbols were referred to as “gene names”)
- Unique identifier. Each individual gene is attached to a gene page on the echinobase which contains information relating to orthologous genes across all supported echinoderms. Each gene page contains a unique identifier. These identifiers consist of a three letter database designation (ECB) and the type of data (e.g. GENEPAGE) followed by a set of digits. Individual species’ genes on each gene page also have their own unique identifier.
Gene names and symbols are based on orthology and should be assigned according to the hierarchy below:
- If an echinoderm gene has a single human ortholog (defined as being identified on three or more orthology tools), the echinoderm gene should be assigned the human gene name and symbol. Deference is given to the human nomenclature as defined by the Human Genome Nomenclature Committee.
- If an echinoderm gene has multiple human orthologs, the symbol and name of the ortholog that matches on the most orthology tools will be used, adjusted as needed to accomodate our guidelines (lower-cased, no roman numerals, etc.) If there are multiple best matches for orthology, assign the first in alphanumeric order (numbers precede letters).
- If there is more than one echinoderm orthology for a single human gene, and they have been identified as pseudoduplicates post-assembly (see section 1.4), assign letters to the echinoderm gene following a decimal point at the human gene stem. These suffixes are independent of species (genes across different species can share a name without an implied orthology at this level)
- If there is more than one echinoderm ortholog for a single human gene, assign numbers to the echinoderm gene following a decimal point at the end of the human gene stem. These suffixes are independent of species (genes across different species can share a name without an implied orthology at this level)
- If an echinoderm gene has no orthologs (novel genes), it retains its NCBI assigned symbol, which are typically formatted as its gene/entrez ID preceded by “LOC”. These may be updated to match gene names and symbols established in new publications from peer-reviewed primary research papers, or from existing publications from peer-reviewed primary research papers (e.g., their legacy name), provided the gene symbol/name does not conflict with the nomenclature guidelines (e.g. does not match a nonorthologous gene already in use in human systems). In these cases new gene identities should be resolved in collaboration with the Echinobase Nomenclature Coordinator.
Echinobase will maintain synonyms (including legacy gene symbols as well as previous symbols and names if orthology-derived identifiers update) for genes for searching purposes.
1.1 Gene Names
Gene names should have the following characteristics:
- Convey the character or function of the gene.
- Be brief and specific.
- Be lower-case and italicized (excluding acronyms, initialisms, and proper names, these are capitalized).
- Be unique with respect to other named genes.
- Use American spelling.
- Only contain Latin letters and Arabic numbers.
- Spell out Greek letters: β → beta (or substitute the comparable Latin letter, as denoted in the table at the end of this nomenclature β → b).
- Change Roman numerals to Arabic equivalents: IV → 4
- Not contain punctuation, except:
- Where necessary to separate the main part of the name from modifiers
- If a comma is part of an established protein name
- If the source ortholog gene’s name contains punctuation
- Periods as needed (see sections 1.c and 1.d)
- Not inherit potentially misleading details from orthology assignments (when legacy names are retained, this rule is not applied). This includes
- Protein information. Example: "59 kDa"
- Expression information. Example: "kidney-specific"
- Species names. Some old gene names included the species name where the gene was first identified.
1.2 Gene Symbols
Gene symbols are abbreviated versions of gene names and should have the following characteristics:
- Consist of 10 or fewer characters (optimally 3-5 characters).
- Be lower-case and italicized.
- If a gene name is derived from an orthologous human gene, the gene symbols should also be identical.
- Only contain Latin letters and Arabic numbers, with no spaces or punctuation (paralogs and pseudoduplicates are an exception, see sections 1.4 and 1.5).
- Convert Greek letters to Latin letters as below
- Change Roman numerals to Arabic equivalents: IV → 4
- Not start with species designators (e.g. Sp, Pm, Lv, etc).
- Be unique and avoid matching common words or abbreviations to avoid problems with database searching (e.g. dna, egta, pcr, pbs, can, and, get ... ).
1.3 Echinobase Identifiers
Echinobase will assign a unique identifier to each gene consisting of a three letter database designation (ECB), the type of name (e.g. GENE) followed by eight digits.
1.4 Pseudoduplicates of genes
Due to the high variety between haplotypes in many Echinoderm species, it is common for assemblies to feature multiple artifactual duplicates of genes (from assembly version to assembly version, many of these are collapsed into single genes, while other new false duplicates can emerge) with identical sequences. Post-assembly, Echinobase identifies these false duplicate clusters by reciprocal BLAST with 1kb up- and downstream extensions, conservatively identifying false duplicates as those with 90% or more identity match over 90% or more of the larger gene. This method also can extract clusters of tandem duplicates with highly similar sequences. These two classes of genes, collectively referred to here as pseudoduplicates, will be provided with an identical gene symbol as their root, and then provided a Latin letter identifier following a decimal point. The use of a Latin letter as opposed to an Arabic numeral will serve to identify these genes as highly similar sequences. It is expected that future genome assemblies will further consolidate false duplicate clusters.
2. Protein nomenclature
Protein designations follow the same rules as gene names and symbols with a few changes as follows:
- Protein names and symbols are not italicized.
- Protein names should have the first letter capitalized.
- The word "protein" or additional terms are not included.
Echinobase Gene Nomenclature Committee:
Veronica Hinman (chair)
Please direct all comments or questions to the Echinobase Nomenclature Coordinator, Toms Beatman