Lvar_0.4 is the latest (as of Apr 28th, 2011) assembly of the genome of green sea urchin, Lytechinus variegatus. The assembly tools CABOG (Celera Assembler), Newbler, ATLAS-Link, and ATLAS-GapFill were used to assemble a combination of 454 reads (fragment and 2.5kb insert pair ends;~13x coverage) and Illumina reads (300bp insert and 2.5kb insert pair ends;~21x coverage).
This information is for the first release (Lavr_0.4) of the draft genome sequence of the green sea urchin, Lytechinus variegatus . This is a draft sequence and may contain errors so users should exercise caution.
Typical errors in draft genome sequences include misassemblies of repeated sequences, collapses of repeated regions, and unmerged overlaps (e.g. due to polymorphisms) creating artificial duplications.
With a goal of solving the polymorphism issues of the data while maintaining the sequence continuity, The Lvar_0.4 assembly was generated in the following steps:
1) 454 reads were assembled by CABOG using settings less stringent than the default (unitigger=bog utgErrorRate=0.03 ovlErrorRate=0.08 cnsErrorRate=0.08 cgwErrorRate=0.14 doExtendClearRanges=1)
This step produced 716Mb of contig and 429Mb of degenerate sequences.
2) Both contig and degenerate sequences from the previous step (a total of 1.1Gb) were chopped into fake reads with ~10x coverage (500bp long;450bp overlap;100 bp minimal length). The fake reads were then assembled by Newbler with the option of -large.
This step produced 801Mb contig sequences with N50 size of 1.87kb, which was used as the backbone for the following process.
3) Both 454 and iIlumina pair end reads were mapped to the contigs from the previous step. We used BLAT to map the 454 data and bwa(aln+samse) to map the Illumina data, both with the default options. Based on the mapping locations of the pair ends, contigs were then ordered and oriented into scaffolds using ATLAS-Link.
4) ATLAS-GapFill was then used to assemble the reads locally in an attempt to fill the gaps among the contigs within the scaffolds. This final step produced 835Mb sequences with contig N50 size of 6.05kb and scaffold N50 size of 39.17kb.
Conditions for use
These data are made available before scientific publication with the following understanding:
- The data may be freely downloaded, used in analyses, and repackaged in databases.
- Users are free to use the data in scientific papers analyzing particular genes and regions if the providers of this data (Baylor College of Medicine Human Genome Sequencing Center) are properly acknowledged. Please cite the BCM-HGSC web site or publications from BCM-HGSC referring to the genome sequence.
- The BCM-HGSC plans to publish the assembly and genomic annotation of the dataset, including large-scale identification of regions of evolutionary conservation and other features.
- This is in accordance with, and with the understandings in the Fort Lauderdale meeting discussing Community Resource Projects and the resulting NHGRI policy statement (https://www.genome.gov/page.cfm?pageID=10506537).
- Any redistribution of the data should carry this notice.
Description of files
There are 2 directories.
I. Contigs/ directory
This directory has 3 files for assembled contigs in the genome, there is no chromosome assignment for the contigs in Lvar_0.4.
Lvar_0.4.20110428.contigs.agp (agp file)
Lvar_0.4.20110428.contigs.fa (fasta file)
Lvar_0.4.20110428.contigs.fa.qual (qual file)
The Lvar_0.4.20110428.contigs.agp file describes the positions and orientations of the contigs in the group. It takes the standard NCBI format.
II. LinearScaffolds/ directory
This directory has 2 files
The sequences are linearized scaffolds where the gaps between adjacent contigs within a scaffold are filled with 'N's and the captured gap size is estimated from the clone insert size.
5. Comparison to ESTs
The Lvar_0.4 assembly was compared to the 454 RNAseq assembly using BLAT:
a. After_Gast EST isotigs
b. Before_Gast EST isotigs
Lvar_0.4 (Apr, 2011) This release was the first assembly of the
Lytechinus variegatus genome.