Does the newer version of Trinity (v2.8.4) make better transcriptome assemblies (for killifish) than the old (v2.2.0)?

tl;dr Yes.

XSEDE Computing Resources

Since I'm overdrawn on my new XSEDE allocation on the PSC Bridges LM (large memory) partition after receiving it on 11/26/2018, a report is required to apply for more.

In case you couldn't tell, I'm a really big fan of the PSC Bridges hpc! I've used 1862 SU out of the 1000 SU they allocated to me on the LM partition and 11484 out of 13000 SU on the RM (regular memory).

In case it might benefit others, here is roughly the progress report that I will submit with my request.

Re-assembling 17 Fundulus killifish transcriptomes

Since receiving my allocation to the PSC Bridges LM partition on 11/26/2018, I have re-assembled 17 transcriptomes from 16 species of Fundulus killifish using the Trinity de novo transcriptome assembler version 2.8.4. I originally assembled transcriptomes from the 16 Fundulus species using Trinity version 2.2.0.

There is some evidence that assemblies can be improved by using the updated versions of Trinity. I wanted to re-assemble these killifish transcriptomes to see if the assemblies could be improved.

These transcriptomes will be used for the purpose of studying the history of adaptation to different salinities. Some Fundulus species can tolerate a range of salinities (euryhaline) by switching osmoregulatory mechanisms while others require a more narrow salinity range (stenohaline) in either fresh or marine waters. This large set of RNAseq data from multiple species is going to help us better characterize the divergence of these molecular osmoregulatory mechanisms between euryhaline and stenohaline freshwater species. To that end, I am attempting to develop a pipeline for orthologous gene expression profiling analysis across species.

This profiling analysis will enable us to compare expression patterns of salinity-responsive genes, e.g. aquaporin 3 (shown below) across clades of species.

The nature of the fragmented assembly output from Trinity makes it especially challenging in our attempt to develop this pipeline. We want to make sure that the reference transcriptomes are the best that they can be so that expression profiles are accurately reflected across species.

Are there differences between the v2.2.0 and v2.8.4 assemblies?

Yes. Are they better? It seems so. But, there are a few items to investigate further.

Below are comparisons of various evaluation metrics between the 17 old v2.2.0 (blue) and new v2.8.4 (green) Trinity assemblies. On the left are slope graphs comparing the metric between both assemblies and on the right are split violin plots showing the distributions of all the assemblies.

Shout out to Dr. Harriet Alexander for helping with the code to visualize these types of comparisons for our paper comparing re-assemblies for the MMETSP - coming out soon in GigaScience!

Code for making these plots is here.

There are fewer contigs.

However, the large numbers >100,000 contigs indicates that these assemblies are still very fragmented. Orthogroup/ortholog prediction is required for downstream analysis.

The BUSCO scores are higher.

What makes these assemblies especially appealing is their higher scores with the Benchmarking Universal Single-Copy Ortholog (BUSCO) assessment tool. One species, Lucania goodei had 100%! The distribution of scores is tighter for the new assemblies compared to the old.

Lower % ORF?

For some reason, mean % ORF decreases. This was not expected.

Number of contigs with ORF

This is roughly the same and a very low number.

Similar unique k-mers

The number of unique k-mers (k=25) does not really change. This means that the content is similar.

Conditional Reciprocal Best Blast

Using transrate --assembly reference mode to examine the comparative metrics with Conditional Reciprocal Best BLAST (CRBB) between assemblies consistently did not work, for some reason. (Whereas it did work for comparison with the NCBI version of the Fundulus heteroclitus sister species) This requires further investigation.

Version:

Transrate v1.0.3
by Richard Smith-Unna, Chris Boursnell, Rob Patro,
   Julian Hibberd, and Steve Kelly

Command:

transrate --assembly=/pylon5/bi5fpmp/ljcohen/kfish_trinity/F_parvapinis.trinity_out.Trinity.fasta --reference=/pylon5/bi5fpmp/ljcohen/kfish_assemblies_old/F_parvapinis.trinity_out.Trinity.fasta --threads=8 --output=/pylon5/bi5fpmp/ljcohen/kfish_transrate/F_parvapinis_trinity_v_old/

Output:

[ INFO] 2018-12-06 22:23:47 : Loading assembly: /pylon5/bi5fpmp/ljcohen/kfish_trinity/F_parvapinis.trinity_out.Trinity.fasta
[ INFO] 2018-12-06 22:24:45 : Analysing assembly: /pylon5/bi5fpmp/ljcohen/kfish_trinity/F_parvapinis.trinity_out.Trinity.fasta
[ INFO] 2018-12-06 22:24:45 : Results will be saved in /pylon5/bi5fpmp/ljcohen/kfish_transrate/F_parvapinis_trinity_v_old/F_parvapinis.trinity_out.Trinity
[ INFO] 2018-12-06 22:24:45 : Calculating contig metrics...
[ INFO] 2018-12-06 22:25:37 : Contig metrics:
[ INFO] 2018-12-06 22:25:37 : -----------------------------------
[ INFO] 2018-12-06 22:25:37 : n seqs                       298549
[ INFO] 2018-12-06 22:25:37 : smallest                        183
[ INFO] 2018-12-06 22:25:37 : largest                       27771
[ INFO] 2018-12-06 22:25:37 : n bases                   310786992
[ INFO] 2018-12-06 22:25:37 : mean len                    1040.98
[ INFO] 2018-12-06 22:25:37 : n under 200                      15
[ INFO] 2018-12-06 22:25:37 : n over 1k                     77121
[ INFO] 2018-12-06 22:25:37 : n over 10k                      802
[ INFO] 2018-12-06 22:25:37 : n with orf                    62453
[ INFO] 2018-12-06 22:25:37 : mean orf percent              43.26
[ INFO] 2018-12-06 22:25:37 : n90                             340
[ INFO] 2018-12-06 22:25:37 : n70                            1141
[ INFO] 2018-12-06 22:25:37 : n50                            2512
[ INFO] 2018-12-06 22:25:37 : n30                            4111
[ INFO] 2018-12-06 22:25:37 : n10                            6966
[ INFO] 2018-12-06 22:25:37 : gc                             0.46
[ INFO] 2018-12-06 22:25:37 : bases n                           0
[ INFO] 2018-12-06 22:25:37 : proportion n                    0.0
[ INFO] 2018-12-06 22:25:37 : Contig metrics done in 52 seconds
[ INFO] 2018-12-06 22:25:37 : No reads provided, skipping read diagnostics
[ INFO] 2018-12-06 22:25:37 : Calculating comparative metrics...
[ INFO] 2018-12-06 23:23:24 : Comparative metrics:
[ INFO] 2018-12-06 23:23:24 : -----------------------------------
[ INFO] 2018-12-06 23:23:24 : CRBB hits                         0
[ INFO] 2018-12-06 23:23:24 : n contigs with CRBB               0
[ INFO] 2018-12-06 23:23:24 : p contigs with CRBB             0.0
[ INFO] 2018-12-06 23:23:24 : rbh per reference               0.0
[ INFO] 2018-12-06 23:23:24 : n refs with CRBB                  0
[ INFO] 2018-12-06 23:23:24 : p refs with CRBB                0.0
[ INFO] 2018-12-06 23:23:24 : cov25                             0
[ INFO] 2018-12-06 23:23:24 : p cov25                         0.0
[ INFO] 2018-12-06 23:23:24 : cov50                             0
[ INFO] 2018-12-06 23:23:24 : p cov50                         0.0
[ INFO] 2018-12-06 23:23:24 : cov75                             0
[ INFO] 2018-12-06 23:23:24 : p cov75                         0.0
[ INFO] 2018-12-06 23:23:24 : cov85                             0
[ INFO] 2018-12-06 23:23:24 : p cov85                         0.0
[ INFO] 2018-12-06 23:23:24 : cov95                             0
[ INFO] 2018-12-06 23:23:24 : p cov95                         0.0
[ INFO] 2018-12-06 23:23:24 : reference coverage              0.0
[ INFO] 2018-12-06 23:23:24 : Comparative metrics done in 3467 seconds
[ INFO] 2018-12-06 23:23:24 : -----------------------------------
[ INFO] 2018-12-06 23:23:24 : Writing contig metrics for each contig to /pylon5/bi5fpmp/ljcohen/kfish_transrate/F_parvapinis_trinity_v_old/F_parvapinis.trinity_out.Trinity/contigs.csv
[ INFO] 2018-12-06 23:23:43 : Writing analysis results to assemblies.csv

Conclusion

Trinity v2.8.4 is better than v2.2.0. While v2.8.4 still produces very fragmented assemblies, the higher BUSCO content is exciting.

There are questions requiring further investigation

Why, if the unique k-mer content is similiar, would the BUSCO scores improve between versions?
ORF content (number of contigs with ORF and mean ORF %) are parodoxical. Why would the ORF content decrease in the newer assemblies?
Why wouldn't transrate ---reference work to get CRBB metrics between these assemblies?

What has improved in Trinity v2.8.4?

According to the release notes from Trinity, major improvements have included using salmon expression quantification to help with filtering out assembly artifacts and overhauling the "supertranscript module" to deal with high polymorphism situations. Since killifish are highly heterozygous, we are likely benefitting from these improvements.

Future directions

In addition to annotating these transcritomes and continuing to investigate the questions above, I would also like to try Matt MacManes' Oyster River Protocol and accompanying 'orthofuser' script to combine contigs from different assemblers.

monsteRbashSeq

New vs. Old versions of Trinity