Table of Contents
- SPU gene releases
- Transcriptome gene model
- Comparison of SPU and NCBI genes
- Mapping GLEAN genes from 0.5 to 2.1
- Original gene symbol rules
SPU gene releases
During sea urchin genome project, several groups came up with gene predictions based on diverse approaches (ab initio, homology-based or empirical). Baylor used GLEAN methodology to combine those gene-sets into 28,944 unique genes. Their structures were derived from V0.5 genome assembly. At SpBase, we adopted Baylor's GLEAN genes and renamed each GLEAN IDs as GLEAN3_12345 to SPU_012345. New SPU genes will be added with IDs starting from 030000.This first release only modified the gene IDs from GLEAN and adopted them into SpBase. No real changes of gene structures were done.
i) Sizes of all 28,944 GLEAN proteins are here.
ii) Only a small fraction of scaffolds (16765/187944) have genes.
ii) 1556 SPU genes do not begin with M. Therefore, their protein structures are incomplete.
iii) Two SPU genes do not end with stop codon - SPU_022205, SPU_028398.
iv) 418 SPU genes have missing amino acids. Among them, 155 have large extended missing regions, whereas 263 have one or few missing amino acids.
SPU Release 1 Description
In this release, we merged data from Baylor's original GLEAN releases with SpBase Annotation database from Sept 10, 2009.We were able to combine structures for 28580 genes, because they were either not modified by the sea urchin community or modified in proper manner (gff, sequence, protein all consistent). The rest are not included in this release. Also, we mapped SPU genes from V0.5 assembly to V2.1 assembly.
Mapping SPU gens from 0.5 to 2.1
This has been a very challenging task.Notes are here.
What is so difficult about gene mapping?
Unlike mapping of ESTs or other sequences, protein-coding genes have many restrictions. They have specific start and stop sequences. So, the original sequence needs to be mapped completely to maintain the full protein structure. Also, if the new assembly inserts a single nucleotide within the exon structure, the exon is not valid any more. In case of sea urchin, the genome assembly changed substantially between 0.5 and 2.1 and that introduced many hard-to-tackle problems. About 1000 genes were on one scaffold in 0.5 assembly, but got split into multiple scaffolds in 2.1.
Who is right?
In other cases, genes that were neighbors in the old assembly were sent to two places. All exons could not be mapped for some genes and that killed the existing protein structure. To add to our trouble, 34 pairs or multiples of genes had identical CDS structures. They often ended up being mapped to the same location in the new assembly (correctly or incorrectly). This situation can only be fixed manually. Here I am listing the maps of all SPU genes divided into four categories representing quality of matches. Only 9 genes could not be mapped at all by the automated process (SPU_025964, SPU_024420, SPU_023769, SPU_023635, SPU_020626, SPU_016058, SPU_015944, SPU_010332, SPU_006355), but I have not tried manual matches yet.
Out of 28944 SPU genes, 13724 mapped perfectly. Their entire intron structure and all exons were identical in 0.5 and 2.1 versions. They can be downloaded from the following link. Download
4495 genes had exactly the same CDS maps, but introns were different. Mostly the differences are for few nucleotides, but a small subset of genes had very large difference. In case of SPU_018483, original intron size was 78467 nt, whereas 2.1 intron size was 147485 nt. The entire set of 18219 genes from category 1 and category 2 can be downloaded from here. Download
For a third set of 8224 genes, some minor variations were observed in CDS structures as well. The differences ranged from a single SNP in the gene structure to multiple SNPs or small insertions. Download. We also performed Clustal matches between old and new structures for each of these genes for anyone interested.
Remaining 2501 genes were either got split into multiple scaffolds or had large exon(s) missing. Only 9 genes could not be mapped at all.This set is listed here. Download A gene split into two scaffolds can be resolved into two ways - breaking the gene into two genes, or joining the scaffolds. It requires a new assembly and some manual intervention.
Things to do
The last set of 2501 genes, including 9 unmapped gene, may be mapped manually.
i) 87 SPU groups have identical genes. are suggestions on how to deprecate some of them.
ii) 397 sets of genes are similar and may see some deprecation.
iii) Many GLEAN genes are fragments of other GLEAN genes. Here is a catalog of them. The page listed
a) Protein pairs where one is identical to fragment of another
b) Current annotation of both from SpBase's latest version (release 4)
c) Clustal alignment of the protein pair to show how similar they are. All histone-related genes are removed from above list, because they are cluttering the table.
d) Neighborhoods in 0.5 and 2.1
e) Possible assembly changes - use EST and other data to resolve these issues.
f) Multiple genes mapping to the same location. Did we take care of some duplication, etc. The issue of genes with identical CDS structures, or sub-nested genes is not fully resolved.
SPU Release 2 Description
A little over 21,000 of the 28,944 GLEAN3 gene models have been assigned names. The last increment has been inferred from electronic annotation (IEA) by Ung-jin Kim. Using a combination of BLAST alignments to a non-redundant protein database, matches to conserved domain data and comparison to orthologs, a putative gene identity is assigned. We have annotated more than 7,000 additional gene models by this method. Roughly 40% have been identified to be homologs of known genes, 30% assigned to a "hypothetical" protein group, 20% assigned to anonymous proteins named after their structural motifs, human open reading frame number or other arbitrary cDNA identifiers. Approximately 3% of the newly examined gene models have no convincing match. The increment of annotations must still be considered tentative since we have not verified them with expressed sequence evidence.
Along with the manual examination of the remaining GLEAN gene models we reduced the redundancy in the official gene set. If a coding sequence is perfectly matched between two models or nearly identical including the 3-prime UTR we tag these as duplicates and remove the details from one of the pair leaving a comment to indicate a duplicate. To facilitate this process we have added all of the 3’ UTR sequences identified by whole genome tiling array.
4,942 gene models have been linked to corresponding articles in PubMed.
SPU Release 3 Description
In this release, we added the genes that were modified by annotators in inconsistent manner(CDS, gff, protein, no stop codon).Two sets of files are submitted - annotation files (comment_table,gene_express_table,gene_info_table, gene_reference_table, go_table) and structure files. Annotation files are carried from Autumn's database except 'gene_express_table' is updated with new SPU_03... names. Structure files are sequence and assembly dependent and need to be reconstructed from various gff and fasta files. 21 genes had structures modifed by the annotators. So, their 3' UTRs added by manoj were removed. SPU_011042,SPU_014765, SPU_004517, SPU_007092, SPU_000595, SPU_012211, SPU_018861, SPU_022187, SPU_025703, SPU_009218, SPU_020752, SPU_010829SPU_018505, SPU_002050, SPU_014332, SPU_001877, SPU_019990, SPU_006579, SPU_007435, SPU_008058, SPU_007964. Following genes do not have SPU_03...names sema5629, Sp-Col805b, Sp-Sema6. sema5629_geneid is in the protein table of Freeze5 annotation database, but not in the gene table. Some annotator submitted it without nucleotide coordinates or sequence. It matches with two neighboring GLEAN genes (SPU_008673 and SPU_008674).'Sp-Col805b' - some annotator submitted a protein sequence that is in the protein table, but the protein sequence does not match SPU_014619. Someone else annotated SPU_014619 as Sp-Col805b, but I am not sure whether those are done by the same annotator, and what we should do in this situation.'Sp-Sema6' - this is the best among three, because it BLAST matches very well with SPU_008673. However, SPU_008673 has two Sp-Sema6s in it. Structural information from Freeze5 were all added in this release, irrespective of whether cds sequence, protein sequence and gff files are all completed properly by the community annotators during sequencing project.
SPU Release 4 Description
Include all ESTs as GLEAN genes and TUs. This will produce some 5' UTRs as well.
Back to the top of the page
Transcriptome gene model
A comprehensive transcriptome analysis has been performed on protein coding RNAs of Strongylocentrotus purpuratus, including 10 different embryonic stages, 6 feeding larval and metamorphosed juvenile stages, and 6 adult tissues. To generate a more complete set of gene models we pooled the transcriptomes from all these sources. The genome had initially been annotated by using computational gene model prediction algorithms and a large fraction of these predicted genes were recovered in the transcriptome when the reads were mapped to the genome. However, we discovered that over half the computational gene model predictions were in one respect or another erroneous, including missing exons, prediction of non-existent exons, erroneous intron/exon boundaries, fusion of adjacent genes, and prediction of multiple genes from what turn out to be single genes. The transcriptome data have been used to provide a systematic upgrade of the gene model predictions throughout the genome and the new gene models incorporated into the genome web information system. This probably represent the most accurate collection of gene sequences among the purple sea urchin sequence collections including the genes deduced from the genome sequence. The individual genes are included in the gene information sequence data and the genome viewer. The protein and nucleotide sequence can be downloaded here
Back to the top of the page
Comparison of SPU and NCBI genes
The comparison results between SPU and NCBI sets can be downloaded from the following link. www.echinobase.org/Echinobase/glean/GLEAN-NCBI-map/glean-NCBI-map.txt
The first column provides the name of the alignment file between two sets (see below), if we performed the alignment. The second column gives the SPU ID and the third column provides the NCBI ID.
The details of the above calculation are given below.
Historically, during the sea urchin genome project, three groups - NCBI, Baylor and Angerer - made gene predictions from the genome using various methods. Also, partial gene models could be derived using EST and cDNA sequence data. GLEAN approach was used to consolidate various predictions into one unified set. This set was utilized for further annotations by sea urchin community.At NCBI, the NCBI prediction set is still being used to present genes and this produces confusion among user communities. In this effort, we tried to link NCBI and GLEAN prediction sets.GLEAN initially had 28944 proteins. After manual curation, about a hundred proteins were added. All GLEAN genes were carried over to SpBase with SPU names.NCBI set has 25,368 proteins on Strongylocentrotus purpuratus (www.echinobase.org/Echinobase/glean/GLEAN-NCBI-map/NR-sea-urchin.fasta) and 231 proteins from other sea urchins (www.echinobase.org/Echinobase/glean/GLEAN-NCBI-map/NR-oth-urchin.fasta). Please note that the NCBI set keeps many genes withidentical protein sequences under the same ID. Here (www.echinobase.org/Echinobase/glean/GLEAN-NCBI-map/glean-NCBI-map.txt) is the result of comparison between GLEAN and NCBI sets. The second column gives the names of GLEAN (SPU) genes and the third column gives NCBI IDs. If a protein has multiple NCBI IDs, the file only includes the primary one.
We also performed clustal alignments between NCBI and GLEAN genes, and those alignments can be seen from www.echinobase.org/Echinobase/glean/GLEAN-NCBI-map/clustal/s20260.aln, where 's20260.aln' is the first column of the above file. For example, s1.aln is available here (www.echinobase.org/Echinobase/glean/GLEAN-NCBI-map/clustal/s1.aln), and so on.
Two subsets of the original comparison file are particularly interesting. They are described below:
After comparison, we found that 3287 NCBI genes had exact matches with GLEAN set. They are listed here (www.echinobase.org/Echinobase/glean/GLEAN-NCBI-map/glean-NCBI-exact.txt).
For another subset of 6833 genes, either the GLEAN gene is completely included within NCBI, or NCBI gene is completely included within GLEAN. They are listed here (www.echinobase.org/Echinobase/glean/GLEAN-NCBI-map/glean-NCBI-partial.txt) with a note about which one is included in which. You will find that only 1365 GLEAN genes are within NCBI genes, whereas 5468 NCBI genes are part of GLEAN genes. This is not surprising, because GLEAN algorithm accumulated various gene fragments from different predictions to create larger genes. We have seen in other organisms that GLEAN produced very large genes (not real) because of this algorithmic defect.
Back to the top of the page
Mapping GLEAN genes from 0.5 to 2.1
Mapping of GLEAN genes to 2.1 assembly
GLEAN3 genes (or SPU genes) are the set of 28,944 high-confidence genes computationally derived from the sea urchin genome sequence. About 1/3 of them were manually annotated by the sea urchin community. The gene models were originally derived and annotated with reference to the 0.5 genome assembly. Here we derive their corresponding locations on the 2.1 assembly using a number of computational approaches. Mapping procedure is described below.
Large amount of sequence difference between the assemblies poses particular difficulty for mapping of genes in case of sea urchin. This was previously seen for mapping the tiling array probes and about 10% of the tiling probes from 0.5 assembly did not find corresponding locations on the 2.1 assembly. Therefore, with regard to gene mapping, three approaches with varying degrees of accuracy were used to determine the locations on the new assembly.
(i) BLAST followed by more careful search
In the first approach, BLAST was used to determine the scaffold match of a gene, and then, the exact coordinates of the exons within the scaffold were determined using PERL pattern matching. Among multiple scaffolds with BLAST hits, the one which matched the highest number of exons was taken. The above approach has the advantage of being very accurate and finding the exact location of the exon. On the other hand,if an exon is partly modified in the new assembly, this method will fail to locate it.
(ii) Finding coordinates from large-scale alignment of 0.5-2.1 assemblies
In this approach, genome-wide alignment between 0.5 and 2.1 assemblies was used to determine the corresponding location of an exon from 0.5 in 2.1 genome. This approach can determine the locations of exons partially modified in the new assembly. However, because the large-scale alignment was made to find only unique correspondences between two genomes, this approach could not map genes thatwere located on repetitive regions of the genome.
(iii) BLAST only
In a third approach, genome-wide BLAST was used to locate a gene on 2.1 assembly. These BLAST results were extracted to find coordinates for the exons. This approach is the least accurate, but is very good to locate any gene or fragments of a gene.
The derived gene coordinates of a gene in the new genome were used to extract its CDS sequence. This sequence was aligned with the existing gene model from 0.5 genome using clustalw, and only those with over 75% match were kept. A total of 27,862 gene models could find match in the new genome. Subsequently, each new gene model was compared with the previous one based on number of exons, total lengths of CDS sequence, lengths of exons and distances between the exons. The comparison was done with the premise that any difference in those qualifiers is more likely to be an error in mapping than a true difference between the genomes. The genes were divided into three grades based on the extent of match with respect to the above qualifies. A total of 18,551 genes were in the topmost grade and almost perfectly matched the detailed qualifiers in0.5 genome. 3,569 genes were in the second grade. They often had one missing exons or larger/less distance between the exons than 0.5. 5,742 genes were in the third grade and saw large difference with 0.5based on many criteria. All exact matches passed exon count test except the following ones.Among exact match genes, following ones had different exon counts between 0.5 and 2.1. Second column is 0.5 and third column is 2.1.
Among the above, all except have one exon split by the new assembly. Only for SPU_016358, the exon got merged. Only SPU_015257 had two hits on 2.1 genome- one with 6 exons like before and another with 7 exons. The one with 7 exons had one base insertion within one of the middle exons. They are most likely the same scaffold made into two due to assembly over fitting. The gene has only one representation in V0.5.Some factoids: In 0.5 genome, 34 sets of GLEAN genes have identical CDS sequences, 33 sets of GLEAN genes have identical mRNA sequences, 87 sets of GLEAN genes have identical protein sequences.
Back to the top of the page
Original gene symbol rules
After looking at the various schemes used in other sequenced genomes and talking to Paul Sternberg of Wormbase, I propose a scheme for gene names and gene symbols to be used in the annotation of the purple sea urchin genome. The gene symbol uses the specification of the mouse nomenclature with the addition of a sea urchin identifier, e.g., Sp-Otx. The specification is described on the mouse Mouse Genome Informatics Web Site, The Jackson Laboratory, Bar Harbor, Maine. In particular the nomenclature page “Quick Guide to Nomenclature for Genes” (URL:http://www.informatics.jax.org/mgihome/nomen/short_gene.shtml [August, 2005]). The symbol description below is cited as coming from Blake JA, Richardson JE, Bult CJ, Kadin JA, Eppig JT, and the members of the Mouse Genome Database Group. 2003. MGD: The Mouse Genome Database. Nucleic Acids Res 31: 193-195.
1) Symbols are 3-5 characters, add additional characters as necessary up to 10.
2) Symbols begin with an uppercase letter followed by all lowercase letters.
3) Use punctuation only to separate two adjacent numbers (e.g., Lamb1-2) or for designating related (e.g., Es10-rs1). sequences and pseudogenes (e.g., Adh5-ps1).
4) Gene Family members: Use common stem (or root) symbol (e.g., see Cldn#).
5) If there is a homolog in vertebrates try to use the vertebrate symbol and any family convention used in vertebrates.
6) For single homologous genes from invertebrates not found in vertebrates use the original symbol from that species but include the word homolog at the end of the name followed by the name of the species in parentheses (e.g., symbol: Cdc20; name: cell division cycle 20 homolog (S. cerevisiae)).
7) If there is more than one sea urchin homolog for the invertebrate gene, assign the serial number after the word "homolog" (e.g., symbols: Atoh1 and Atoh2; names: atonal homolog 1 (Drosophila) and atonal homolog 2 (Drosophila) respectively.)
8) If the invertebrate/prokaryotic gene is similar to the sea urchin gene but is not determined to be a homolog, use the letter "l" to denote "-like" designations (e.g., symbol: Ash2l; name: ash2 (absent, small, or homeotic)-like (Drosophila).
1) Should be brief and specific.
2) Should convey the character or function of the gene.
3) In most cases there will be a usable name from the homolog.
Back to the top of the page