Genome Assemblies

Table of Contents

  1. Assembly_2.1(Spur_2.1)
  2. Assembly_2.5(Spur_2.5)
  3. Assembly_2.6(Spur_2.6)
  4. Assembly_3.1(Spur_3.1)
  5. Difference between 0.5, 2.0 and 2.1

Assembly_0.5(Spur_0.5)

Spur_0.5 is a preliminary assembly of the California purple sea urchin, S. purpuratus, using whole genome shotgun(WGS) reads with the Atlas genome assembly system at the Baylor college of Medicine Human Genome Sequencing Center.The products of the Atlas assembler are a set of contigs and scaffolds. The total length of all contigs greater than 1kb is 768Mb, the N50 of the contigs larger than 1kb is 10.18 kb and the N50 of the scaffolds is 47.98 kb. The total span of the assembly is 1.13 Gb, which is 240 Mb larger than the estimated genome size. The sequence coverage is 6.2X.A preliminary examination showed that over 90% of the sequences in other available sea urchin sequence data sets (Unigene clusters) is represented in the Spur_0.5 assembly. By comparison to 25 NCBI HTGS_PHASE2 BACs( total 2.9Mb), some types of inconsistency were found: several cases of short non-merging overlaps were observed, most at the tail of scalffolded contigs. this may due to polymorphism such that the merging criteria were not met.several short contigs were found aligning in the middle of long alignment gaps of large scaffolded contigs (7 cases), these large gaps come from scaffolding with only short (2 ~6k) and large (50k, 150k)inserts but no middle sized (10 ~ 15 k) inserts, resulting in unfilled large gaps and artificial expansion of total sequence size in the super contigs. Other minor inconsistencies included three cases of differences between genome contigs and PHASE2 BACS, and two possible misjoins. Checking the three contigs in detail did not identify misassemblies. One possible misjoin is in a repeat region and one is a possible local misordering of a short 2k contig in the middle long scaffolded contig.

Back to the top of the page

Assembly_2.1(Spur_2.1)

Spur_2.1 combines BAC reads and WGS reads and utilizes BAC tiling path information. Contaminations identified in Spur_2.0 were removed. Compared to previous assembly releases, Spur_2.1 is more continuous and has fewer false duplications. The Spur_2.1 release was assembled from 2-fold average coverage in sequence reads from Bacterial Artificial Chromosomes (BAC) and 6-fold coverage in Whole Genome Shotgun (WGS) with the HGSC Atlas-2.0 genome assembly system at Baylor College of Medicine. The BAC reads were produced by the Clone-Array Pooled Shotgun Sequencing method (CAPSS) from BAC clones selected based on a minimal FingerPrinted Contigs (FPC) tiling path.In CAPSS pooled BAC reads are assigned to individual BACs by deconvolution. Each BAC assembly was enriched with WGS reads that overlap with the individual BAC reads. The mixed reads sets were assembled locally with Atlas. Sets of overlapping BAC clones were identified based on shared WGS reads and sequence overlaps. The overlapping enriched BACs were then merged together to form the backbone of genome assembly. The merged BAC assemblies were further scaffolded using information from mate pairs, BAC clone vector locations, and BAC tiling path information. Finally contigs from the WGS assembly Spur_0.5 were used to fill gaps in BAC assembly to produce Spur_2.0 release. Extensive contamination analysis was done on Spur_2.0 release. Spur_2.1 release was produced by removing contaminated sequences from Spur_v2.0 release. The Spur_2.1 release includes a set of contigs (continuous blocks of sequence) and scaffolds. Scaffolds include sequence contigs that can be ordered and oriented with respect to each other (multi-contig scaffolds) as well as contigs that could not be linked (single-contig scaffolds or singletons). The N50 of the scaffolds associated with BACs is 216 Kb.The N50 of all scaffolds is 142 Kb. The total length of all contigs greater than 1kb is 804 Mbps. When the gaps between contigs in scaffolds are included, the total pan of the assembly is 907 Mbps. The estimated size of the genome based on the assembly is 814 Mbps.The Spur_2.1 assembly was compared with other available sea urchin sequence data (ESTs, Unigene clusters) to determine the extent of coverage (completeness). A preliminary examination showed over 90% of the sequences in this data set is represented, indicating that the shotgun libraries used to sequence the genome were comprehensive. Typical errors in draft genome sequences include misassemblies of repeat sequences, collapses of repeat regions, and artificial duplications in polymorphic regions. However base accuracy in contigs is usually very high with most errors near the ends of contigs. These data can be downloaded from

ftp://ftp.hgsc.bcm.edu/Spurpuratus/fasta/Spur_v2.1/

Back to the top of the page

Assembly_2.5(Spur_2.5)

README for Genome Sequence of Strongylocentrotus purpuratus, Spur_2.5(February 11, 2010)

1. What's New
2. Introduction
3. Description of files
4. Sequence and Scaffold statistics
5. Read statistics
6. History

1. What's New

This is the release (Spur_v2.5) of the upgraded genome assembly of sea urchin Strongylocentrotus purpuratus using additional ABI SOLiD sequence for superscaffolding and gap filling. The scaffold N50 increased by 43kb and the contig N50 increased by 2kb.

2. Introduction

The draft assembly may contain errors so users should exercise caution. Typical errors in draft genome sequences include misassemblies of repeat sequences, collapses of repeat regions, and artificial duplications in highly polymorphic regions. However base accuracy in contigs is usually very high with most errors near the ends of contigs. Additional sequence coverage generated using the ABI SOLiD technology from small insert fragments was incorporated into this version of the assembly. The paired-end reads were used for superscaffolding and intra-scaffold gap filling. The assembly contains new scaffolds formed from merging scaffolds and scaffolds where some intra scaffold gaps are filled by other scaffold or contigs. As a result, the scaffold N50 increased to 166Kb and the contig N50 increased by 2kb. The additional sequence is ~18x genome coverage and the reads have an average insert size of ~1.5k. 500 million reads and have a read length of 25bp, and 46 million reads have a read length 50bp. Out of the total ~273 million clones, 30 million have both ends uniquely mapped. The 13% of these uniquely mapping pairs (4 million) that link two different scaffolds were used to upgrade the assembly with a recently developed scaffolding algorithm.Comparison to the 17,461 available S. purpuratus Unigene sequences from NCBI showed the genome assembly is nearly complete (see section 4). The number of Unigene contigs aligning over 95% or more of their length and the number of Unigene contigs aligning over 80% or more of their length both increased by 0.1% Comparison to the SOLiD sequencing reads (see read statistics section 5) confirmed the quality of the assembly. Over 99.8% of the pairs of uniquely mapping reads within scaffolds were correctly oriented. Over 99.6% of the pairs of uniquely mapping reads within scaffolds had insert sizes of

3. Description of files

The files can be found on the HGSC ftp site
ftp://ftp.hgsc.bcm.edu/Spurpuratus/fasta/Spur_v2.5/
This directory has 3 files for assembled contigs in the genome, there are no chromosome assignments for the contigs in Spur_v2.5. The .gz files are compressed with gzip. Spur2.5.AGP (AGP file)Spur2.5.contig.fa(contig.fa)Spur2.5.contig.fa.qual (contig quality)

I. The AGP file describes how to combine the individual contigs to create the linearized genome sequences in the LinearScaffolds directory.

II. linearScaffolds/ directory
This directory has 1 fasta files and 1 quality files compressed with gzip. Spur2.5.linearScaffold.fa (scaffoldlinear scaffold sequence) Spur2.5.linearScaffold.fa.qual (scaffold linear scaffold sequence quality) The fasta formatted sequence file (Spur2.5.AGP.linearScaffold.fa) are for linearized scaffolds where the gaps between adjacent contigs within a scaffold are filled with 'N's and the captured gap size is estimated from the clone insert size. Each scaffold is a separate sequence within the files.

III. unassembled/ directory
This directory has 2 files which have not changed from the previously assembly, Spur_v2.1 Spur_v2.1.unassembled.reads.fa.gz (unassembled reads fasta file) Spur_v2.1.unassembled.reads.fa.qual.gz (unassembled reads quality file). The unassembled reads files contain reads that are not used in the assembly.
4. Sequence and Scaffold statistics before and after upgrade

Assembly Type Number N50(kb) Bases+Gaps(Mb) Bases(Mb)
Spur_2.5 Scaffolds 77,726 166,504 919,498,871 813,404,257
Spur_2.1 Scaffolds 114,222 123,485 907,070,087 810,023,010
Spur_2.5 Contigs 198,392 11,500 813,404,257 813,404,257

Alignment of scaffolds to Unigene* contigs before and after upgrade
Alignment of scaffolds to Unigene* contigs before and after upgrade

Alignment length 100% 95% 80% 50%
Scaffolds Aligning** 89.30% 99.10% 99.80% 99.90%
Scaffolds Aligning*** 89.30% 99.20% 99.90% 99.90%

*Total ‹ 17,461

Unigene sequences used for completeness check.**before***after

5. Read statistics

Sequence Coverage [6] 18x
Total Raw reads (25bp) 250,382,437 249,997,486 500,379,923
Raw reads (50bp) 43,267,758 43,279,703 86,547,461
Uniquely mapped (25b)[1] 27,879,092 27,879,092 55,758,184
Uniquely mapped (50b)[1] 3,912,119 3,912,119 7,824,238
Bridge scaffolds (25bp)[2] 3,323,043 3,323,043 6,646,086
Bridge scaffolds (50bp)[2] 720,919 720,919 1,441,838
Within scaffold (25bp)[3] 24,556,049 24,556,049 49,112,098
Within scaffold (50bp)[3] 3,191,200 3,191,200 6,382,400
Mis-oriented reads (25bp)[4] 33,671 33,671 67,342
>5kb insert size (25bp)[5] 94,060 94,060 188,120

[1] Reads from clones whose F3 end and R3 end both uniquely mapped.
[2] Reads which are from [1] and whose F3 end and R3 end are mapped to two different scaffolds.
[3] Reads which are from [1] and whose F3 end and R3 end are mapped to same scaffold.
[4] Reads which are from [3] 25bp and whose F3 end and R3 end are mapped in wrong orientation.
[5] Reads which are from [3] 25bp and whose inferred insert size from mapping are bigger than 5k, too big to be realistic.
[6] Sequence coverage was calculated as the total SOLiD reads bases divided by estimated genome size (800 Mb).

6 History Spur_v2.5 (Feburary, 2009)

Improved assembly of Spur_v2.1 using SOLiD mate pairs.Spur_v2.1 (September, 2006) This release is based on Spur_v2.0, with contaminations removed. Spur_v2.0 (June, 2006)

This release is an independent assembly that combines BAC skim readsand WGS reads.Spur_v0.5 (April, 2005)

This release update removed 716 contigs of contaminating (non-S. purpuratus) sequence and overlapping(second haplotype contigs). Otherwise the assembly statistics remain unchanged.Spur_v0.4 (March, 2005)

This release updated the agp file to omit scaffolds of contaminating (non-S. purpuratus) sequence and update coordinates for 65 pairs of overlapping contigs. Otherwise the assembly statistics remain unchanged.Spur_v0.3 (November, 2004)

This release is the first, preliminary assembly of the California purple sea urchin, Strongylocentrotus purpuratus, genome.

Back to the top of the page

Assembly 2.6(Spur 2.6)

This version of the updated Strongylocentrotus purpuratus genome sequence was derived from the Spur2.5 version through the removal of contaminating E. coli sequences. The gene sequences have changed very little.

Back to the top of the page

Assembly 3.1(Spur 3.1)

README for Genome Sequence of Strongylocentrotus purpuratus, Spur_v3.1(June 15th, 2011)

Conditions for Use
The data may be freely downloaded, used in analyses, and repackaged in databases. Some of the data presented here represents work inprogress. It is being released by the Baylor College of Medicine Human Genome Sequencing Center (BCM-HGSC) prior to project completion as a public service to allow our colleagues to search for genes or functions and speed their research. These data have not been edited and are presented "as is." You should regard the data as preliminary if it is unpublished. The data providers and associated funding agencies bear no responsibility for the user's reliance upon or interpretation of these data. The accuracy or reliability of the data is not guaranteed or warranted in any way and the providers disclaim liability of any kind. If you use this preliminary information we request that you honor the following conditions:
Please communicate your results to us so that we can incorporate them into the annotation of the final sequence. Contact us at hgsc-help@hgsc.bcm.tmc.edu.
Acknowledge the information obtained from BCM-HGSC in publications by stating in Materials and Methods and Acknowledgements: "Preliminary sequence data was obtained from [https://www.hgsc.bcm.edu/ Baylor College of Medicine Human Genome Sequencing Center website]." Also acknowledge our funding source, which is listed in each project, with a statement such as "The DNA sequence of [organism] was supported by [grant number from funding agency to PI] at the BCM-HGSC." We also request that you notify us when your manuscript is accepted and send us a pre-print of the article.
Use of this data or information derived from it on a web page is permitted, providing the web page contains the statement that "Preliminary sequence data was obtained from the [https://www.hgsc.bcm.edu/ Baylor College of Medicine Human Genome Sequencing Center website]." Please inform us of your web page by sending email to hgsc-help@hgsc.bcm.tmc.edu.
All other written or oral public disclosures of research using data from the BCM-HGSC should follow the acknowledgment guidelines outlined above.
However, although we encourage use of this preliminary information for limited studies, we request that you do not publish whole genome or chromosome scale analyses of genes or genomic data prior to the publication of the BCM-HGSC report on the final genome sequence and analysis. Contact the BCM-HGSC at hgsc-help@hgsc.bcm.tmc.edu to discuss a waiver of this request, which could involve simple acknowledgment, co-authorship, or other methods.
Any redistribution of the data should carry this notice.

What's New
This is the eighth release (Spur_3.1) of the genome assembly of sea urchin, Strongylocentrotus purpuratus. This assembly used additional Illumina reads with different end sequence spacing, known as a rainbow library series. The rainbow libraries consist of 4 libraries, a fragment paired end library with ~300bp insert size, and three mate-pair libraries with 1kb, 3kb and 5-6kb inserts. Each library has approximately 10x sequence coverage of genome (see Read statistics below for details). The reads were mapped to the Spur_v2.6 genome assembly and then were used for superscaffolding using the Atlas-Link software and local assembly and gap filling using Atlas-GapFill software. The scaffold N50 increased from 168kb (Spur_v2.6) to 404kb and the contig N50 increased by ~2kb.

Introduction
This is a draft assembly which may contain errors so users should exercise caution. Typical errors in draft genome sequences include misassemblies of repeat sequences, collapses of repeat regions, and artificial duplications in highly polymorphic regions. However base accuracy in contigs is usually very high with most errors near the ends of contigs. A rainbow library sequencing strategy was used to increase the contiguity of the S. purpuratus genome assembly, increasing the average scaffold length and closing assembly gaps. Four different shotgun libraries with nominal insert sizes of ~300bp, 1k, 3k, 5-6kb were constructed for Illumina sequencing. Reads from the recircularized 1k, 3k, 5-6kb libraries were trimmed from the 3' end to different lengths (50bp, 80bp, 120bp) and mapped. Reads that could be mapped were retained at the longest length that mapped to avoid the mapping issues created when the junction fragment is included in the mapped sequence.The reads from the shorter insert paired end library were mapped using the entire read length. After mapping, all reads were combined and used for super-scaffolding and intra-scaffold gap filling with the Atlas-link software. Then a local assembly of reads around each assembly gap was carried out to further fill the gaps using the Atlas-GapFill software. As a result, the scaffold N50increased to 404 Kb and the contig N50 increased by around 2kb. Comparison to the 17,461 available S. purpuratus Unigene sequences from NCBI showed the genome assembly is nearly complete (see the Sequence and Scaffold statistics section below). The number of Unigene contigs aligning 100% of their length increased by 0.1% .

Description of files
The files can be found on the HGSC ftp site ftp://ftp.hgsc.bcm.edu/Spurpuratus/fasta/Spur_v3.1/

  • Contigs directory
  • This directory has 3 files for assembled contigs in the genome, there are no chromosome assignments for the contigs in Spur_3.1. The .gz files are compressed with gzip.*Spur_3.1.AGP (AGP file)*Spur_3.1.contig.fa (contig fa)*Spur_3.1.contig.qual (contig quality)*acc_ctg_num.tbl (table listing GenBank accession number for each contig)The AGP file describes how to combine the individual contigs to create the linearized genome sequences in the LinearScaffolds directory.

  • Linear Scaffolds directory
  • This directory has 1 fasta file and 1 quality file compressed with gzip.
    *Spur_3.1.linearScaffold.fa (scaffold linear scaffold sequence)
    *Spur_3.1.linearScaffold.qual (scaffold linear scaffold sequence quality)The fasta formatted sequence files are for linearized scaffolds where the gaps between adjacent contigs within a scaffold are filled with 'N's and the captured gap size is estimated from the clone insert size. Each scaffold is a separate sequence within the files.

Sequence and Scaffold statistics

Assembly Type Number N50(kb) Bases+Gaps(Mb) Bases(Mb)
Spur_v3.0 Scaffolds 31,238 404,330 936,069,451 816,170,552
Spur_v3.0 Contigs 174,512 13,472 816,170,552 816,170,552

Alignment of scaffolds to 17,461 Unigene contigs before and after upgrade

Percent of Scaffolds Aligning to Genome over Alignment Length

Alignment length 100% 95% 80% 50%
Spur_v2.6 89.30% 99.10% 99.80% 99.90%
Spur_v3.0 89.40% 99.20% 99.80% 99.90%

Read statistics

PE (300bp) 1kb 3kb 5-6kb
Total reads 69,680,000 73,200,000 84,453,334 85,480,000
Read length 125bp 150bp 150bp 150bp
Mapped 67.0% 67.2% 68.2% 64.4%
Bridge contigs 3,840,891 6,600,203 9,684,914 10,558,589
Within contigs 32,467,846 21,503,082 20,154,150 14,136,572
Mis-oriented[1] 32,996 41,662 74,554 67,288
Over distance[2] 763,098 430,730 435,184 678,546
Good pairs[3] 31,671,752 21,030,690 19,994,412 13,390,738

[1] Mis-oriented reads map with an orientation of the two ends of the pair that is not expected. For PE, reads are expected to be oriented as -> .
[2] Over distance indicates the count of mates with excessive distance between mates, the following insert size cutoff were used:PE: >800bp1k: > 2000bp3k: > 4000bp5-6k: > 8000bp
[3] Good pairs refers to pairs in expected orientation and insert size.

Spur_3.1 (June, 2011)Contamination removed version of Spur_3.0.Spur_v3.0 (March, 2011)This release is an improved assembly using a variety of Illumina libraries with different mate-pair distances for scaffolding and gap filling. Spur_v2.6 (April, 2010)Contamination removed version of Spur_v2.5 Spur_v2.5 (February, 2010)Improved assembly of Spur_v2.1 using SOLiD mate pairs.Spur_v2.1 (September, 2006)This release is based on Spur_v2.0, with contaminations removed. Spur_v2.0 (June, 2006)
This release is an independent assembly that combines BAC skim readsand WGS reads.Spur_v0.5 (April, 2005)
This release update removed 716 contigs of contaminating(non-S. purpuratus) sequence and overlapping (second haplotype contigs).Otherwise the assembly statistics remain unchanged.Spur_v0.4 (March, 2005) ftp://ftp.hgsc.bcm.tmc.edu/pub/data/This release updated the agp file to omit scaffolds of contaminating(non-S. purpuratus) sequence and update coordinates for 65 pairs ofoverlapping contigs. Otherwise the assembly statistics remain unchanged.Spur_v0.3 (November, 2004)
This release is the first, preliminary assembly of the California purple sea urchin, Strongylocentrotus purpuratus, genome.

Difference between 0.5, 2.0 and 2.1

Introduction

All analysis published in sea urchin genome paper were based on 0.5 assembly version of the genome. Subsequently, Baylor released 2.0 and 2.1 assembly versions. Substantial improvements were achieved from 0.5 to 2.0, whereas the 2.1 assembly was a cleaned up version of 2.0.

Baylor's description of 2.0 and 2.1 assemblies

Baylor released the following comments to describe 2.0 and 2.1 assemblies:
Spur_2.1 combines BAC reads and WGS reads and utilizes BAC tiling path information. Contaminations identified in Spur_2.0 were removed. Compared to previous assembly releases, Spur_2.1 is more continuous and has fewer false duplications. The Spur_2.1 release was assembled from 2-fold average coverage in sequence reads from Bacterial Artificial Chromosomes (BAC) and 6-fold coverage in Whole Genome Shotgun (WGS) with the HGSC Atlas-2.0 genome assembly system at Baylor College of Medicine. The BAC reads were produced by the Clone-Array Pooled Shotgun Sequencing method (CAPSS) from BAC clones selected based on a minimal Finger Printed Contigs (FPC) tiling path. In CAPSS pooled BAC reads are assigned to individual BACs by deconvolution. Each BAC assembly was enriched with WGS reads that overlap with the individual BAC reads. The mixed reads sets were assembled locally with Atlas. Sets of overlapping BAC clones were identified based on shared WGS reads and sequence overlaps. The overlapping enriched BACs were then merged together to form the backbone of genome assembly. The merged BAC assemblies were further scaffolded using information from mate pairs, BAC clone vector locations, and BAC tiling path information. Finally contigs from the WGS assembly Spur_0.5 were used to fill gaps in BAC assembly to produce Spur_2.0 release. Extensive contamination analysis was done on Spur_2.0 release. Spur_2.1 release was produced by removing contaminated sequences from Spur_v2.0 release.

Comparison between 2.0 and 2.1

Because 2.0 assembly is not much different from its better version 2.1, most of our bioinformatics calculations are being done on 2.1. We only made quick comparison between 2.0 and 2.1 and found them to be nearly identical. Among 114224 scaffolds of 2.1 assembly, 113694 were fully copied from 2.0 and 530 were different. Among those 530 scaffolds in 2.1, 249 + 9 were parts of 2.0 scaffolds, whereas 272 were of same length as 2.0 version but with N regions filled up.

Comparison between 0.5 and 2.1

We generated complete maps between V0.5 and V2.1 genomes. These maps can be used to convert any previously developed resources on V0.5 to V2.1 assembly.Among all 114222 V2.1 scaffolds, 83754 are identical to V0.5 scaffolds. The remaining ~30K scaffolds of V2.1 assembly are significantly different from V0.5. Most of them are large scaffolds containing most SPU genes. 4026 are super-sets of 6572 V0.5 scaffolds and 6106 are part of V0.5 scaffolds. For the last case, V0.5 assembled scaffolds were incorrect and broken into parts.

Mapping procedure

The maps were generated in the following manner.
a) All 30 mers in the entire genomes of 0.5 and 2.1 assemblies were determined.
b) Those sequences were binned together and only the ones satisfying the following criteria were kept: (a) 30-mer matched W strand of 0.5 genome, (b) 30-mer had exactly one match each in 0.5 and in 2.1 genomes.
c) Those unique 30-mers were combined into longer overlapping regions between the genomes. Because of the way the 30-mers were screened, the derived regions are unambiguous - i.e. repetitive regions are not expected to create any duplication in the mapping.
d) Neighboring fragments from the genome were combined into longer identical genomic regions.The created overlap file can be used to map any region in one assembly to another, unless the segment is on a repeat region and cannot be uniquely mapped to the other genome.

Back to the top of the page