Kistler L et al. (2017), A massively parallel strategy for STR marker de...

ECB-ART-45601

Nucleic Acids Res 2017 Sep 06;4515:e142. doi: 10.1093/nar/gkx574.

Show Gene links Show Anatomy links

A massively parallel strategy for STR marker development, capture, and genotyping.

Kistler L , Johnson SM , Irwin MT , Louis EE , Ratan A , Perry GH .

???displayArticle.abstract???
Short tandem repeat (STR) variants are highly polymorphic markers that facilitate powerful population genetic analyses. STRs are especially valuable in conservation and ecological genetic research, yielding detailed information on population structure and short-term demographic fluctuations. Massively parallel sequencing has not previously been leveraged for scalable, efficient STR recovery. Here, we present a pipeline for developing STR markers directly from high-throughput shotgun sequencing data without a reference genome, and an approach for highly parallel target STR recovery. We employed our approach to capture a panel of 5000 STRs from a test group of diademed sifakas (Propithecus diadema, n = 3), endangered Malagasy rainforest lemurs, and we report extremely efficient recovery of targeted loci-97.3-99.6% of STRs characterized with ≥10x non-redundant sequence coverage. We then tested our STR capture strategy on P. diadema fecal DNA, and report robust initial results and suggestions for future implementations. In addition to STR targets, this approach also generates large, genome-wide single nucleotide polymorphism (SNP) panels from flanking regions. Our method provides a cost-effective and scalable solution for rapid recovery of large STR and SNP datasets in any species without needing a reference genome, and can be used even with suboptimal DNA more easily acquired in conservation and ecological studies.

???displayArticle.pubmedLink??? 28666376
???displayArticle.pmcLink??? PMC5587753
???displayArticle.link??? Nucleic Acids Res
???displayArticle.grants??? [+]

Genes referenced: LOC115925415

???attribute.lit??? ???displayArticles.show???

	Figure 1. BaitSTR computational pipeline and massively parallel STR enrichment strategy.
	Figure 2. Orthologous locations of diademed sifaka STR targets on the human reference genome. (A) Genome-wide distribution of genic (blue, n = 2267) and intergenic (red, n = 1982) diademed sifaka STR loci that could be mapped the human genome.
	Figure 3. Target Enrichment Results. (A) Number of target STR loci recovered in shotgun (n = 2) and captured (n = 5) libraries. Read data were randomly downsampled using SAMtools (32) after read mapping and before genotype calling to normalize all libraries to 30 million input reads for cross-comparability. Actual reads generated per captured sample ranged from 34.8 million to 73 million (Supplemental Table S1). (B) Per-site coverage is highly correlated among samples, illustrating non-random variation in marker enrichment. Marker coverage in the 30 million read subsample is compared between Titania Oberon capture data (left axis, blue), and Titania and Romeo's tissue library (right axis, red). (C) Enrichment of reads carrying target STRs in subsamples of 30 million reads. Left axis shows the proportion of callable reads, right axis shows the estimated enrichment level given the genomic expectation of 0.000156 of reads on target with no enrichment. For the fecal libraries, enrichment values are given without any correction for the high proportion of exogenous DNA, whereas previous estimates of endogenous fecal DNA content suggest actual enrichment similar to the tissue samples.
	Figure 4. SNP-STR compound markers. (A) Simulated example of a phased STR-linked SNP locus, where the ‘A’ SNP allele associates with six repeats of the TC motif and the ‘G’ allele associates with seven. (B) Proportions of genic, non-genic, and unplaced STR loci (based on mapping analysis to the human reference genome; see Figure 1) with number of SNPs detected at ≥4× coverage on associated inserts from three lemur tissue sample libraries (Oberon, Titania, and Romeo). Forty two STR loci associated with >15 SNPs (n = 8 genic, n = 25 non-genic, n = 9 unplaced) are not shown.
	Figure 5. Simulated performance of STR discovery using BaitSTR. (A) At variable k-mer lengths and coverage levels, the number of discoverable simulated bi-allelic STRs in a random synthetic genome that could be discovered with no requirement of local extension. 1000 total markers were present. (B) Using the same simulated STRs in a synthetic genome, this simulation required successful block extension to 200nt non-repeat flanks and a total contig length of 500nt. (C) At variable coverage levels, the number of heterozygous STRs discovered in the NA12878 genome data, along with a low frequency of false positives and the number of heterozygous SNPs recovered from extended blocks.

References [+] :

Agrafioti, SNPSTR: a database of compound microsatellite-SNP markers. 2007, Pubmed