Genome Assemblies

Table of Contents

  1. Assembly 0.4 (Lvar_0.4)
  2. Assembly 2.2 (Lvar_2.2)
  3. Assembly LvMSCB
  4. Assembly LvPtE5C
  5. Other Comments

Lvar0.4 genome assembly

Introduction

This information is for the first release (Lvar_0.4, NCBI GCA_000239495.1) of the draft genome sequence of the green sea urchin, Lytechinus variegatus . This is a draft sequence and may contain errors so users should exercise caution.
Typical errors in draft genome sequences include misassemblies of repeated sequences, collapses of repeated regions, and unmerged overlaps (e.g. due to polymorphisms) creating artificial duplications.
With a goal of solving the polymorphism issues of the data while maintaining the sequence continuity, The Lvar_0.4 assembly was generated in the following steps:
1) 454 reads (fragment and 2.5kb insert pair ends;~13x coverage) were assembled by CABOG (Celera Assembler) using settings less stringent than the default (unitigger=bog utgErrorRate=0.03 ovlErrorRate=0.08 cnsErrorRate=0.08 cgwErrorRate=0.14 doExtendClearRanges=1)
This step produced 716Mb of contig and 429Mb of degenerate sequences.
2) Both contig and degenerate sequences from the previous step (a total of 1.1Gb) were chopped into fake reads with ~10x coverage (500bp long;450bp overlap;100 bp minimal length). The fake reads were then assembled by Newbler with the option of -large.
This step produced 801Mb contig sequences with N50 size of 1.87kb, which was used as the backbone for the following process.
3) Both 454 and iIlumina pair end reads (300 bp insert) were mapped to the contigs from the previous step. We used BLAT to map the 454 data and bwa(aln+samse) to map the Illumina data, both with the default options. Based on the mapping locations of the pair ends (2.5 kbp insert), contigs were then ordered and oriented into scaffolds using ATLAS-Link. Sum coverage for all Illumina reads was ~21x.
4) ATLAS-GapFill was then used to assemble the reads locally in an attempt to fill the gaps among the contigs within the scaffolds. This final step produced 835Mb sequences with contig N50 size of 6.05kb and scaffold N50 size of 39.17kb.

Conditions for use

These data are made available before scientific publication with the following understanding:

- The data may be freely downloaded, used in analyses, and repackaged in databases.
- Users are free to use the data in scientific papers analyzing particular genes and regions if the providers of this data (Baylor College of Medicine Human Genome Sequencing Center) are properly acknowledged. Please cite the BCM-HGSC web site or publications from BCM-HGSC referring to the genome sequence.
- The BCM-HGSC plans to publish the assembly and genomic annotation of the dataset, including large-scale identification of regions of evolutionary conservation and other features.
- This is in accordance with, and with the understandings in the Fort Lauderdale meeting discussing Community Resource Projects and the resulting NHGRI policy statement (http://www.genome.gov/page.cfm?pageID=10506537).
- Any redistribution of the data should carry this notice.

Description of files

There are 2 directories.
I. Contigs/ directory
This directory has 3 files for assembled contigs in the genome, there is no chromosome assignment for the contigs in Lvar_0.4.
Lvar_0.4.20110428.contigs.agp (agp file)
Lvar_0.4.20110428.contigs.fa (fasta file)
Lvar_0.4.20110428.contigs.fa.qual (qual file)
The Lvar_0.4.20110428.contigs.agp file describes the positions and orientations of the contigs in the group. It takes the standard NCBI format.

II. LinearScaffolds/ directory
This directory has 2 files
Lvar_0.4.20110428.linear.fa
Lvar_0.4.20110428.linear.fa.qual
The sequences are linearized scaffolds where the gaps between adjacent contigs within a scaffold are filled with 'N's and the captured gap size is estimated from the clone insert size.

Sequence statistics

Scaffolds/Contigs Number N50(kb) Bases(Mb) Gap(Mb)
All Scaffolds 330,611 39.17 966 131
All Contigs 518,238 6.05 835 N/A

5. Comparison to ESTs
The Lvar_0.4 assembly was compared to the 454 RNAseq assembly using BLAT:
Align_Span/Isotig_Length

>=50% >=80% >=95% 100%
a Percent 99.5% 97.8% 93.4% 69.2%
b Percent 99.5% 97.3% 91.8% 67.6%

a. After_Gast EST isotigs
b. Before_Gast EST isotigs

6. History

Lvar_0.4 (Apr, 2011) This release was the first assembly of the
Lytechinus variegatus genome.

Back to the top of the page

Lvar 2.2 genome assembly

Assembly details, to the extent known, are described at the NCBI genome entry: GCA_000239495.2.
PacBIO sequence with 13x coverage, and the preceding Lvar 0.4 assembly, were processed with PBJelly to make new scaffolds.

Sequence statistics

Scaffolds/Contigs Number N50(kb) Bases(Mb) Gap(Mb)
All Scaffolds 322,794 46.35 1061 57.1
All Contigs 452,418 9.66 1004 N/A

Back to the top of the page

LvMSCB genome assembly

Introduction

The LvMSCB genome assembly was constructed with MaSuRCA using the 23x coverage 454 reads and 13x PacBIO reads previously employed in the Lvar 0.4 and 2.2 assemblies. In addition it used a new Illumina data set (100x coverage, 400bp insert, paired ends). This assembly has a large number of false joins, resulting from overly aggressive splicing through repeat sequences. However, for that reason it was very useful in determining the repeat sequences present in this genome. That is, long repeat sequences were created by overlapping very similar copies of that repeat, producing a longer repeat than could be otherwise built. However in doing so distant genomic regions were often joined across a hybrid repeat. Additionally, the superreads and megareads it produces are reasonably accurate representations of small sections of the genome. Contiguous nonrepetitive genome regions may be used to confirm predictions made in other assemblies but should not otherwise be relied upon.

Production of Superreads and Megareads

MaSuRCA's generation of megareads was slightly modified. The PacBIO reads were first passed through the CCS routine from Proovread to form, where possible, circular consensus sequences (see below). MaSuRCA then generated superreads from the 100x Illumina PE data. Each superread is a nonforking assembly of Illumina reads. These are mostly if not entirely haplotype specific. The Megaread production was modified slightly by adding Proovread's Siamaera detecting code, which is supposed to detect and truncate readthroughs of the SMRTbell adapter (see below). Megareads are PacBIO reads corrected using the superreads. These are not haplotype specific because the noisy nature of PacBIO reads often results in the best match being to a superread of the other haplotype. In other words, sections of Megareads are frequently converted from one haplotype to the other.

454 read processing

A file SRR.desc containing the list of all 40 454 SRA files for biosample SAMN00205415 was constructed, marking each SRR* with sn, pe, or rs (single, paired end, or RNA-seq). The sra files which were not rs were retrieved to a directory. These were processed as follows (extract and execinput are from drm_tools):

ls -1 *sra \
  | extract -mt -dl '.' -fmt 'sffToCA -libraryname [1] -trim chop -output [1] [1].sff &'\
  | nice execinput > sff2CAB.log 2>&1
# wait until all processes complete
cat SRR*frg >r454_all.frg
rm *fastq SRR*frg SRR*sra

Note, the preceding is not ideal because some of the SRR are from the same library and there may be duplicate reads in different files. For LvPtE5C sffToCA was rerun as follows:

export LIST=`extract -in SRR.desc -if ' sn ' -ifonly -mt -dl ' ' -fmt '[1].sff ' -eol '' -fileeol '\n'`
$TOPDIR/MaSuRCA/CA8/Linux-amd64/bin/sffToCA -libraryname single \
  -trim chop -output single $LIST \
  >sff2CA3S.log 2>&1
rm single.1.fastq single.2.fastq #these are empty
mv single.u.fastq single.fastq

export LIST=`extract -in SRR.desc -if ' pe ' -ifonly -mt -dl ' ' -fmt '[1].sff ' -eol '' -fileeol '\n'`
$TOPDIR/MaSuRCA/CA8/Linux-amd64/bin/sffToCA -insertsize 2500 250 -libraryname pairs \
  -trim chop -linker titanium -output pairs $LIST \
  >sff2CA3P.log 2>&1 &
#made pairs.1.fastq pairs.2.fastq pairs.u.fastq

A better cumulative frg file would have been:

cat single.frag pairs.frg >r454_all.frg

MaSuRCA configuration file

The MaSuRCA project.cfg contains:

DATA
PE= pe 400 20 $TOPDIR/Lv/Lv_ILLUMINA/pe_400_R1.fastq $TOPDIR/Lv/Lv_ILLUMINA/pe_400_R2.fastq
PACBIO=$TOPDIR/Lv/Lv_PACBIO/pacbio_all_CCS.fasta
OTHER=$TOPDIR/Lv/Lv_454/r454_all.frg
END

PARAMETERS
GRAPH_KMER_SIZE = auto
USE_LINKING_MATES = 1
LIMIT_JUMP_COVERAGE = 300
CA_PARAMETERS = cgwErrorRate=0.15 doExtendClearRanges=0
KMER_COUNT_THRESHOLD = 1
NUM_THREADS = 40
JF_SIZE = 80000000000
SOAP_ASSEMBLY=0
END

ccs preprocessing of PacBIO sequences


cd $TOPDIR/Lv/Lv_PACBIO
export PATH=$PATH:$TOPDIR/src/proovread/bin:$TOPDIR/src/proovread/util/bwa
export PATH=$PATH:$TOPDIR/src/proovread/util/Seq*/bin
export PERL5LIB=$TOPDIR/src/proovread/lib
fasta_to_fastq.pl
  | ccseq -t 40 > out_ccs.fastq 2>out_ccs.log
#[17-08-15 11:43:24] [ccseq] Reading STDIN
#[17-08-16 09:37:14] [ccseq] Processed 6564119 subreads from 3060427 reads
#[17-08-16 09:37:14] [ccseq] 1403356 consensus + 1657071 bypassed single subreads
fastasplitn -in out_ccs.fastq -q4 -q2f -n 1 -p 1 \
  | extract -if '>' -mt -dl ' ' -fmt '[1]' -wl 100000 >pacbio_all_CCS.fasta

mega_reads_assemble_nomatch_PROOVREAD.sh

The MaSuRCA provided mega_reads_assemble_nomatch.sh script modified to perform Proovread Siamaera processing is here:
mega_reads_assemble_nomatch_PROOVREAD.sh.

assemble.sh scipt

The assemble.sh script produced by MaSuRCA was modified before running so that its last line became:
$TOPDIR/MaSuRCA/bin/mega_reads_assemble_nomatch_PROOVREAD.sh -t 40 \
   -e $ESTIMATED_GENOME_SIZE -m work1 -p $TOPDIR/Lv/Lv_PACBIO/pacbio_all_CCS.fasta\
   -a $TOPDIR/MaSuRCA/bin/../CA8/Linux-amd64/bin \
   -o " $TOPDIR/Lv/Lv_454/r454_all.frg pe.linking.frg cgwErrorRate=0.15 doExtendClearRanges=0"
   -B 17 -D 0.029

Sequence statistics

A deduplication step is applied to the scaffold file produced by the Celera Assembler to produce the final scaffold LvMSCB, whose statistics are shown below. The final contig file produced by the Celera Assembler whose statistics are shown here has not been similarly treated.

Scaffolds/Contigs Number N50(kb) Bases(Mb) Gap(Mb)
All Scaffolds 32916 59.47 1021 6.02
All Contigs 64833 30803 1058 1380

Back to the top of the page

LvPtE5C genome assembly

Introduction

The LvPtE5C assembly uses the same data sets employed by LvMSCB. In addition it uses the megareads from the LvMSCB assembly to scaffold and fill. LvPtE5C is not a derivative of either Lvar 0.4 or Lvar 2.2. Since the assemblies are independent, regions where LvPtE5C and Lvar 2.2 agree are more likely to be correct, and regions where they differ more likely to be in error (in one or both). The process consists of many steps whose commands are presented in some detail in LvPtE5C_commands.txt. A summary of the steps is presented below.

Illumina read contamination

The S. purpuratus Illumina reads produced at the same time were found to be contaminated with Mus musculus mRNA sequences like NM_018794 and NM_010511. The L. variegatus Illumina data appears not to be similarly contaminated, as these sequences were not found.

Platanus production of contigs

Platanus is used to produce an initial set of contigs from the 100X Illumina paired end data set. This initial set of contigs often contains for any genome region copies from both haplotypes. This was true even though the -u 1.0 command line parameter was used, which should merge some of these haplotype pairs into a single sequence. Because of the highly polymorphic nature of this genome the common deduplication tools are not very good at identifying and eliminating one sequence from each such pair. This is especially true because such sequences often overlap with an offset, which the deduplication tool employed would not recognize as a duplication. Each time two sequences are joined and the gap filled that provides another opportunity for the deduplication method to remove one half of such a pair. This process is repeated three times below.

Assembly overview

  1. Platanus trim 100X Illumina PE data
  2. Platanus trim Illumina MP data
  3. Platanus assemble trimmed 100X Illumina PE data
  4. Platanus scaffold contigs using Illumina and 454 pair data.
  5. FGAP (in a wrapper script) to fill gaps with megareads.
  6. bbmap dedupe.sh to remove duplicates
  7. Map 454 and Illumina mate pair reads onto dedup sequences using Opera preprocess_reads.pl, makes BAM files
  8. OPERA-pacbio-read.pl to prepare more BAM and other linkage files from PacBIO reads and 100x Illumina paired end data onto dedup sequences.
  9. OPERA-LG to scaffold
  10. FGAP (in a wrapper script) to fill gaps with megareads.
  11. bbmap dedupe.sh to remove duplicates
  12. P_RNA_scaffolder to scaffold using RNA-seq reads in SRR1661111 (Illumina RNA-seq)
  13. FGAP (in a wrapper script) to fill gaps with megareads.
  14. bbmap dedupe.sh to remove duplicates

Sequence statistics

A deduplication step is applied to the scaffold file produced by the Celera Assembler to produce the final scaffold LvMSCB, whose statistics are shown below. The final contig file produced by the Celera Assembler whose statistics are shown here has not been similarly treated.

Scaffolds/Contigs Number N50(kb) Bases(Mb) Gap(Mb)
All Scaffolds 175668 55.11 9388 33.19
All Contigs NA NA NA NA

Back to the top of the page

Other Comments

Linkage limitations

Both the Illumina mate pair and 454 paired reads have templates which are roughly 2.5 kbp in size. The mean size of the megareads, which is essentially the same thing as the mean size of the usable part of the raw PacBIO reads, is 2.47 kbp, with a standard deviation of 3.1 kbp. With these limitations on long range linkage assemblers begin having difficulty building over repeats once these reach about 2 kbp. A rare long PacBIO read may span a much longer read, up to around 6 kbp, but given the coverage, it would be extremely unlikely for there to be two such reads. Accepting linkage based on a single PacBIO read is unwise since that read might be chimeric.

Scaffold mapping onto a BAC control

There exists a small set of 27 BAC sequences for this organism which may be used as controls. None of them are completely sequenced, but the pieces are ordered. Two of them, AC146989 and AC131496 overlap. These were extracted and fused with the EMBOSS megamerger program to produce a sequence "bacmerge" 154 kbp long and containing 2200 N characters. Lvar 2.2, LvMSCB, and LvPtE5C were mapped against that with the command
blastn -query /tmp/bacmerge.fasta -subject assembly.fasta -outfmt 6
and the best scaffolds noted. These are shown graphically in the image below, where x,X,Y,Z are mismatches, - is a match, and <N| indicates that the scaffold extends N kbp beyond the end of the reference sequence. x indicates a bad join, which is a problem in two of the assemblies. The 3' ~4.5 kbp of the control is largely composed of repeats having tuples consistent with ~20x copies, and the X,Y,Z sequences in the assemblies each have somewhat different sequences there - none of which match the BAC control well. (Right click on the image and open in a new window if it is distorted.)

Back to the top of the page