Monday, May 6, 2013

Follow the biology (pt. 4). Now we are getting somewhere...

First off, apologies on the complete lack of updates. In all honesty, there hasn't been much going on with this project since last summer but now I'm finally at the point of figuring out what kinds of genes are behind this mysterious brown colony phenotype in P. stutzeri. 

Picking up where I left off , I've been through many unsuccessful attempts to disrupt the brown phenotype with transposon mutagenesis. The other straightforward option for figuring out the genetic basis for this effect is to sequence the whole genome of the mutant strain and find differences between the mutant and the wild type. Luckily this is 2013 and is easily possible. Confession time, while I know my way around genome scale data and can handle these type of analyses, I am by no means a full fledged bioinformaticist. I'm a microbial geneticist that can run command line unix programs and program a little bit in perl and python, out of necessity. If you have advice or ideas or a better way to carry out the analyses I'm going to describe below, please leave a comment or contact me outside of the blog. I'm always up for learning new and better ways to analyze data!

Long story short, I decided to sequence a variety of genomes using Illumina 100bp PE libraries on a HiSeq. For a single bacterial genome one lane of Illumina HiSeq is complete and utter overkill, when you can multiplex and sequence 24 at a time it's still overkill but less so. In case you are wondering, my upper limit it 24 for this run because of money not because of lack of want. It's a blunt decision, but one of many cost-benefit types of decisions you have to weigh when running a lab on a time dependent budget.

Before I even start to look into the brown mutant genome (next post), the first thing that needs to be done is sequencing and assembly of the reference, "wild type" strain. The bacteria I'm working with here is Pseudomonas stutzeri, specifically strain 28a24 from this paper by Johannes Sikorski. I acquired a murder (not the right collective noun, but any microbiologist can sympathize) of strains from Johannes a few years ago because one of my interests is on the evolutionary effects of natural transformation in natural populations. At the time I had just started a postdoc in Jeff Dangl's lab working with P. syringae, saw the power of Pseudomonads as a system, and wanted to get my hands on naturally transformable strains. Didn't quite know what I was going to do with them at the time, but since I started my lab in Tucson this 28a23 strain has come in very handy as an evolutionary model system.

I received my Illumina read files last week from our sequencing center, and began the slog that can be assembly. While bacterial genome assembly is going to get much easier and exact in the next 5 years (see here) right now working with Illumina reads is kind of like cooking in that you start with a base dish that is seasoned to flavor. My dish of choice for working with Illumina reads is SOAPdenovo. Why you may ask? Frankly, most of the short read assemblers perform equally on bacterial genomes for Illumina PE. Way back when I started using Velvet as an assembler, but over time became frustrated with the implementation. I can't quite recall when it happened, but it might have been when Illumina came out with their GAII platform and the computer cluster I was working with at the time didn't have enough memory to actually assemble genomes with Velvet. Regardless...I mean no slight to Daniel Zerbino and the EMBL team, but at that point I went with SOAPdenovo and have been working with this since due to what I'd like to euphemistically call "research momentum".

Every time I've performed assemblies with 100bp PE reads, the best kmer size for the assemblies always falls around 47 and so I always start around that number and modify as necessary (In case you are curious, helpful tutorials for how De Bruijn graphs work can be found here). One more thing to mention, these reads have already been trimmed for quality. The first thing I noticed with this recent batch of data was that there was A HUGE AMOUNT of it, even when split into 24 different samples. The thing about bacterial genome assemblies is that they start to crap out somewhere above 70 or 80x coverage, basically because depth and random errors confuse the assemblers. If I run the assembler including all of the data, this is the result:

2251 Contigs of 2251 are over  bp -> 4674087 total bp
218 Contigs above 5000bp -> 1612110 total bp
30 Contigs above 10000 bp -> 352858 total bp
0 Contigs above 50000 bp -> 0 total bp
0 Contigs above 100,000 bp -> 0 total bp
0 Contigs above 200,000 bp -> 0 total bp
Mean Contig size = 2076.44913371835 bp
Largest Contig = 20517

The output is from a program Josie Reinhardt wrote when she was a grad student in Corbin Jones' lab (used again because of research momentum). The first line gives the total amount of contigs and scaffolds in the assembly, as well as the total assembly size. In this case, including all of the data yields 2251 total contigs, a total genome size of 4.67Mb, and a mean/largest contig of 2076/25,517bp. Not great, but I warned you about including everything.

Next, I'm going to lower the coverage. Judging by this assembly and other P. stutzeri genome sizes, 4.5Mb is right where this genome should be. For 70x or so coverage I am going to only want to include   approximately 1.6 million reads from each PE file ( (4,500,000*70) / 200). When I run the assembly this time, the output gets significantly better:

467 Contigs of 467 are over  bp -> 4695960 total bp
224 Contigs above 5000bp -> 4375593 total bp
158 Contigs above 10000 bp -> 3890697 total bp
11 Contigs above 50000 bp -> 709014 total bp
0 Contigs above 100,000 bp -> 0 total bp
0 Contigs above 200,000 bp -> 0 total bp
Mean Contig size = 10055.5888650964 bp
Largest Contig = 81785

I can show you more of this type of data with different amounts of coverage, but it's not going to change the outcome. The only other parameter to change a bit is kmer size. The figures above are for 47, but what happens if I use 49 with this reduced coverage?

382 Contigs of 382 are over  bp -> 4694856 total bp
178 Contigs above 5000bp -> 4429753 total bp
130 Contigs above 10000 bp -> 4065927 total bp
18 Contigs above 50000 bp -> 1317727 total bp
2 Contigs above 100,000 bp -> 211768 total bp
0 Contigs above 200,000 bp -> 0 total bp
Mean Contig size = 12290.1989528796 bp
Largest Contig = 110167

Even better (and the best assemblies I've gotten so far). Increasing kmer size to 51 makes it incrementally worse:

479 Contigs of 479 are over  bp -> 4633694 total bp
164 Contigs above 5000bp -> 4361629 total bp
124 Contigs above 10000 bp -> 4057109 total bp
19 Contigs above 50000 bp -> 1389519 total bp
1 Contigs above 100,000 bp -> 120161 total bp
0 Contigs above 200,000 bp -> 0 total bp
Mean Contig size = 9673.68267223382 bp
Largest Contig = 120161

Last question you might have, since there are other P. stutzeri genomes publicly available, is why not use those and carry out reference guided assembly?  I haven't really looked into this much yet, but with some quick analyses it doesn't seem like these genomes are really that similar to one another (~85% nucleotide identity, aside from many presence/absence polymorphisms) so it's not that easy to actually line them up by nucleotide sequence. Maybe better when I've got protein sequences, but that's for another post.

So now I'm ready to compare the brown phenotype genome vs. this assembled draft of 28a24. I know what the answer is, but I'm going to leave that until the next post.

No comments:

Post a Comment

Disqus for