FAQs - Gene Names and IDs
Gene Nomenclature and Related ID Questions
Discovery of new genes, or analysis of current genes sometimes leads to gene nomenclature questions. Genome-wide studies involving transcripts and proteins require related IDs. Here we have listed a few of the more common questions. Please address all comments or questions to firstname.lastname@example.org.
Frequently Asked Questions
How are gene names assigned?
In the previous version of Echinobase, the term “gene name” was used to describe what are now referred to as “gene symbols”, the concise abbreviations used for genes, which are then expanded upon in the more descriptive, current “gene names”. Both gene symbols and names are curated through the Echinobase Gene Nomenclature Guidelines.
Gene identifiers are originally assigned by the NCBI annotation pipeline. Every gene model is assigned a symbol derived from existing NCBI manual annotation when possible; other gene models are assigned a symbol comprised of LOC###### where the ###### element is the Entrez ID. Gene symbols and names are updated through the nomenclature guidelines on Echinobase, but the Entrez IDs will never change, making them a static ID for the gene models that will remain unchanged across data sets and annotations.
Where are the Echinobase Gene Nomenclature Guidelines?
The Echinobase Gene Nomenclature Guidelines can be found here. These guidelines were created to coincide with the Human Gene Nomenclature guidelines and have been approved by the Echinobase Nomenclature Committee (ENC). They serve to provide a robust naming schema for echinoderm genes based upon sequence orthology.
What do the Gene Symbol and Gene Name statuses mean?
There are three levels of gene symbol and name curation implemented across the numerous gene pages, corresponding to the the degree of processing that has occurred through the nomenclature pipeline:
Curated: These genes have been processed through the nomenclature pipeline, and have had their names and symbols updated to represent evolutionary relationships to their human orthologs or have retained their NCBI annotated symbols and names if no orthologs have been identified. Following the completion of the orthology pipeline by our biostaticians and its connection to the nomenclature pipeline, all genes will have this status.
Partially Curated: These genes are a selection of ~2000 genes of interest to users, including those involved in Gene Regulatory Networks, those that were manually curated in earlier assemblies, and those with additional data associated with them in prior versions of Echinobase. These gene symbols and names are assigned to frequently studied genes until full curation can be completed.
Provisional: These are genes whose symbols and names are those generated by The NCBI Eukaryotic Genome Annotation Pipeline prior to curation through the Echinobase Gene Nomenclature Guidelines. They can be separated into two types: manually curated identities and predicted identities. Manual identities are those matched to RefSeq gene records by NCBI; these will have traditional symbols and names associated with them. These are not based on the Echinobase Gene Nomenclature Guidelines. Predicted identities are formatted as generic locus IDs for the gene symbol (“LOC” followed by the Entrez ID), with the gene name based on the predicted protein product, according to the NCBI pipeline.
As gene symbols and names change, previous symbols and names will be added to the synonym line on each gene page, to allow users to easily find genes even when their curation status changes and gene symbols and names are updated.
What are synonyms and why aren’t these species-specific?
Synonyms are any and all gene symbols, gene names, or other associated identifiers that have been used historically. This can include use in publications, NCBI annotations, and earlier stages of gene curation. These synonyms are recorded and kept associated with their cognate gene page to facilitate users finding their genes of interest even as their symbols and names are updated through our nomenclature pipeline. Gene pages for any given gene are not specific to a species, and so all species’ synonyms are included here to allow users to readily find the gene page. Details of species orthologs are displayed in columns of the Echinobase ID, Molecules and Genomic sections of the gene page.
There are many gene "names", which one should I use?
Please refer to the current entries for gene symbols and names found on any given gene page. These identifiers are independent of the species supported by Echinobase. If and when these are changed or updated due to further curation, older symbols and names will be retained as additional synonyms, aiding users to find such genes through the gene searchbar. Using the ID found at the top of each gene page (ECB-GENEPAGE-########) is another static ID that can be used if need be, and is also independent of the species; additional species specific ECB-GENE IDs can be found at the top of each species’ column.
For mapping of GFF data to the current gene symbols and names, refer to column 9 of the GFF which is "Dbxref=GeneID:" followed by the corresponding Entrez ID number. This same ID is in column 5 of the ExternalGeneRef file available in the Gene Page Reports of the FTP site and will allow for cross referencing of the current gene symbols and names.
How do I link locus IDs to Entrez IDs?
Where there is a Locus ID (LOC#####) the numbers (#####) are the Entrez ID. However, LOC_IDs, are not universally assigned, only Entrez IDs are. Gene symbols, gene names, and gene synonyms will have a LOC-ID because they were assigned by the NCBI annotation pipeline which precedes Echinobase curation. For mapping between different datatables, the numerical strings that constitute the Entrez IDs should be used (if in the “LOC#####” format, you are likely using the wrong column for this purpose).
How do I link Entrez IDs to mRNAs and alternate transcripts (NM_ and XM_)?
Both are in the GFF. The Entrez IDs are the unique identifiers assigned to the gene models by the NCBI annotation pipeline. The mRNAs are then mapped to the gene models using RefSeq Select. This pipeline provides curated mRNAs with NM_ identifiers and non-curated transcripts are assigned XM_ identifiers. The Entrez IDs are in column 9 of the GFF which is "Dbxref=GeneID:" followed by the corresponding Entrez ID number. For mapping to current Echinobase names use the Entrez ID in column 5 of the ExternalGeneRef file available in the Gene Page Reports of the FTP site.
How do I link Entrez IDs to protein IDs (NP_ and XP_)?
Both are in the GFF. The Entrez IDs are the unique identifiers assigned to the gene models by the NCBI annotation pipeline. The proteins are then mapped to the gene models using RefSeq Select. This pipeline provides curated proteins with NP_ identifiers and non-curated transcript products are assigned XP_ identifiers. The Entrez IDs are in column 9 of the GFF which is "Dbxref=GeneID:" followed by the corresponding Entrez ID number. For mapping to current Echinobase names use the Entrez ID in column 5 of the ExternalGeneRef file available in the Gene Page Reports of the FTP site.
I have found a new gene, how do I name it?
Congratulations! Please keep in mind that the symbol and name must be unique among all species, must conform to our gene nomenclature guidelines, and should be informative as to the function or role of the gene. Please forward as much information as possible to the nomenclature coordinator who will communicate with the ENC to assist in assigning the most appropriate symbol and name.
I have a question about orthologs outside of currently supported Echinoderm species on Echinobase, can you help?
Yes, we can. We routinely work with other model organism databases to collaborate on gene names. If you have any questions about whether two genes in different species are orthologs, or any other ortholog questions, please let us know.
Last Updated: 2020-12-21