Echinobase Nomenclature Guidelines for Genes and Other Sequences
Develop standards for naming genes, transcripts, non-coding sequences, and genomic resources (hereafter referred to as ‘sequences’) curated by Echinobase. This nomenclature system should be consistent, intuitive, succinct, and generalizable. Sequence names should also be human- and machine-readable.
These guidelines will be used to name the sequences curated by Echinobase. These have been informed by standards employed by other Model Organism Databases (MODs). Echinobase administers the nomenclature for all species currently hosted on the site, and future echinoderm species as they are sequenced.
To facilitate broad phylogenetic analyses and simplify investigations across phyla, sequence nomenclature will primarily be informed by homology in conjunction with needs for concision. However, these guidelines also delineate protocols for naming sequences that lack characterized orthologs in other species.
Nomenclature guidelines for genes, transcripts, non-coding sequences, and genomic resources
For these guidelines, we rely on the following definitions:
Echinobase sequence nomenclature is primarily based on sequence orthology. To assign orthology, we use an approach based on the DRSC Integrative Ortholog Prediction Tool (DIOPT), which integrates output from several algorithms. For each sequence in Echinobase, orthology predictions will be presented for multiple species with direct links to other MODs (e.g. ENSEMBL). Note that ortholog predictions are not definitive, but rather represent the likelihood that the pair is orthologous. Thus, several algorithms that report the same pairing provide additional confidence in the orthology assignment. Unless conclusive (following development of new orthology pipeline for 5.0), gene symbols and names will be given a provisional status.
1. Gene nomenclature
Each gene is assigned the following:
- Gene name. A descriptive identifier that briefly conveys the structure or function of the gene.
- Gene symbol. A short abbreviation for speaking and writing about the gene.
- Unique identifier. A numeric identifier for indexing by Echinobase. Identifiers consist of a three letter database designation (ECB), the type of name (e.g. GENE) followed by eight digits.
Gene names and symbols are based on orthology and should be assigned according to the hierarchy below:
- If an echinoderm gene has a single vertebrate ortholog, the echinoderm gene should be assigned the vertebrate gene name and symbol. Deference is given to the human nomenclature as defined by the Human Genome Nomenclature Committee.
- If an echinoderm gene has multiple co-orthologs in vertebrates, the echinoderm gene name should consist of the root name of the vertebrate gene appended with the best matching ortholog number (or letters as appropriate). If there are multiple best matches for orthology, choose the first in alphanumeric order (numbers precede letters). When multiple orthologs are found, and some do not belong to the same gene family, resolve in collaboration with the Gene Nomenclature Coordinator.
- If there is more than one echinoderm co-ortholog for a single vertebrate gene, assign numbers to the echinoderm gene following a decimal oint at the end of the vertebrate gene.
- If an echinoderm gene lacks a vertebrate ortholog but has a single invertebrate ortholog, the echinoderm gene should be assigned the gene name from that species. The echinoderm gene symbol should be the same as that of the orthologous gene.
- If an echinoderm gene has structural or functional similarity to a vertebrate, invertebrate or prokaryotic gene (e.g. Predicted genes (in silico) which show a high degree of sequence homology to well characterized genes, or have a legacy name and symbol in common, but no indicated orthology), use the letter "l" (lower case 'L') to denote "-like" designations.
- For more complex evolutionary relationships, see the section below on paralogs and gene families (see section 1.4).
- Echinobase will assign official gene symbols and names. Novel names for genes without clear orthology will be developed in collaboration with the HGNC.
- If orthology/likeness is uncertain (echinoderm specific genes), gene symbols/names are established by the earliest date of publication in a peer-reviewed primary research paper (e.g., their legacy name), unless the gene symbol/name is in use already in the HGNC. In this case resolve in collaboration with the Echinobase Nomenclature Coordinator.
Nomenclature Guidelines’ precedence can be overridden in favor of nomenclature that is clearly favored by the community. This can be on a gene-by-gene basis or for an entire gene family or other functional groupings.
Echinobase will maintain synonyms (including legacy gene symbols) for genes for searching purposes.
1.1 Gene names
Gene names should have the following characteristics:
- Convey the character or function of the gene.
- Be brief and specific.
- Be lower-case and italicized (excluding acronyms, initialisms, and proper names, these are capitalized).
- Be unique with respect to other named genes in the same species.
- Use American spelling.
- Only contain Latin letters and Arabic numbers.
- Spell out Greek letters: β → beta.
- Change Roman numerals to Arabic equivalents: IV → 4
- Not contain punctuation, except:
- Where necessary to separate the main part of the name from modifiers.
- If a comma is part of a protein name.
- If the ortholog base's gene name contains punctuation.
- Periods as needed to indicate orthology or likeness information in the nomenclature hierarchy
- Not inherit potentially misleading details from orthology assignments (when legacy names are retained, this rule is not applied). This includes
- Protein information. Example: "59 kDa"
- Expression information. Example: "kidney-specific"
- Assay information.
- Species names. Some old gene names included the species name where the gene was first identified.
- If a gene is a member of an established gene family, the nomenclature should follow the conventions of that family.
Gene symbols are abbreviated versions of gene names and should have the following characteristics:
- Consist of 10 or fewer characters (optimally 3-5 characters).
- Be lower-case and italicized.
- If a gene name is derived from an orthologous human gene, the gene symbols should also be identical.
- Only contain Latin letters and Arabic numbers, with no spaces or punctuation (coorthologs are an exception, see 1.c).
- Convert Greek letters to Latin letters as in HGNC document
- Change Roman numerals to Arabic equivalents: IV → 4
- Not start with species designators (e.g. Sp, Pm, Lv, etc).
- Be unique and avoid matching common words or abbreviations to avoid problems with database searching (e.g. dna, egta, pcr, pbs, can, and, get ... ).
1.3 Echinobase Identifiers
Echinobase will assign a unique identifier to each gene consisting of a three letter database designation (ECB), the type of name (e.g. GENE) followed by eight digits.
Example: S. purpuratus genes are designated the same as other species "ECB-GENE-00000000"
This accession refers to a gene locus at a specific location in the genome. These accessions should remain consistent among genome assembly versions; lift-over tables will be provided in subsequent assembly releases. All sequences related to or derived from this locus (i.e. transcript sequences determined by transcriptome assembly or other means) will receive a numerical suffix. For example:
|SPU_013015 (ECB-GENE-013015)||brachyury gene identifier (GLEAN)|
|SPU_013015.1 (ECB-GENE-013015.1)||Additional or other information has been used to define the gene|
Alternate splice variants will be assigned a lower case letter added to the suffix, where isoform â€œaâ€ should be the longest isoform identified:
|SPU_013015.1a||one of Qiang Tu’s embryonic RNA-seq splice forms|
|SPU_013015.1b||another of Qiang Tu’s embryonic RNA-seq splice forms|
1.4 Gene families
Gene families are a set of paralogous genes formed by duplication of a single ancestral gene. Genes within gene families usually have similar biological functions.
To designate genes with gene families, a root word should be assigned. Gene family members should be assigned increasing unique numerical identifiers.
Pseudogenes are DNA sequences that are similar in structure to normal genes but do not transcribe functional RNAs or encode functional proteins. Note that genes may be pseudogenized in one species but functional in other species. Pseudogenes should be assigned names and symbols as described for genes, but with "pseudogene" appended to the gene name and "p" appended to the gene symbol. If multiple pseudogenes are present, they should be designated with Arabic numerals.
2. Transcript nomenclature
RNA names and symbols are the same as those for genes: lowercase, italicized, containing Latin letters and Arabic numbers only.
Alternative splice forms that originate from a single gene are assigned the gene name followed and a â€œvariantâ€ tag with a number. The symbol is designated: genesymbol-v#.
3. Protein nomenclature
Protein designations follow the same rules as gene names and symbols with a few changes as follows:
- Protein names and symbols are not italicized.
- Protein names should have the first letter capitalized.
- The word "protein" or additional terms are not included.
- Protein variants arising from alternative spliced variants of genes should use the symbol as the transcript, including the -v and sub-sequence identifiers.
4. Non-coding RNA nomenclature
These guidelines address the nomenclature system for transcripts that are not translated into proteins but instead function as RNA molecules. As additional RNA species are discovered, their nomenclature will be discussed by the Echinobase Nomenclature Committee.
RNA symbols should be italicized.
4.1. microRNA (miRNA) genes
Symbols for microRNAs consist of the root symbol â€œmiRâ€ followed by the numbering scheme tracked in miRBase.
4.2. Ribosomal RNA (rRNA) genes
Symbols for genes encoding ribosomal RNAs have the format 'nSrRNA:X', where n denotes the respective rRNA's sedimentation rate in Svedberg units and X is an annotation ID used to distinguish gene copies. To refer to the generic gene for each rRNA type, the suffix is omitted.
Example: Generic gene: 18SrRNA; Specific gene: 18SrRNA:CR41548
4.3. Transfer RNA (tRNA) genes
Transfer RNA nomenclature is based on that used at the GtRNAdb (gtrnadb.ucsc.edu). Symbols for genes encoding tRNAs have the format 'tRNA:Xxx-YYY-N-N', where Xxx is the 3-letter amino-acid code; YYY is the anticodon; and N-N is a 2-digit identifying suffix: the first digit is the same for all tRNA genes of a given anticodon that have identical sequence, and the second digit increments for each copy of that sequence in the genome.
4.4. Regulatory sequence nomenclature
Regulatory sequences are regions of DNA that affect the transcription of other genes (e.g., enhancers, promoters, and cis-regulatory modules). These sequences can affect multiple genes, and can be distant from the affected gene(s). Thus, it is misleading to name them based on specific genes.
Instead, regulatory sequences are assigned names as follows: SPU_RrX:Y, where Rr stands for "regulatory region", and X indicates Chromosome location and Y the next number in the series.
Echinobase Nomenclature Steering Committee:
Veronica Hinman (chair)
Please direct all comments or questions to the Echinobase Nomenclature Coordinator, Toms Beatman