Clust is a pipeline from Abu-Jamous and Kelly (2018) that will take a gene expression data table (or tables), cluster and extract so that patterns of genes that are co-expressed can be identified. This is a method similar to WGCNA.

To explore clust, I used an s1.xlarge instance on Jetstream (CPU: 24, Mem: 60 GB, Disk: 240 GB, Disk: 240 GB root). This instance size might be excessive, but I already had the instance up and running Jupyter notebook trying to parse and do some dataframe manipulation between 17 large ~1 GB gff3 files. While this gff3 parsing task was frustrating me by not working, I thought that I would try using the clust program since it has been on my list for some time, while I already had the instance running.

This turned out to be an encouraging task because clust is very easy to run!

Make a conda environment for python 2.7. Doesn't matter if python 3 is default environment as long as you specify in command to make python 2, like this:

conda create -n run_clust python=2.7
source activate run_clust

Then, install clust:

pip install clust

Get expression data table formatted. I have one on osf to download as an example:

mkdir clust_data
cd clust_data/
curl -L https://osf.io/fg72b/download -o exp.tsv
cd ..

Format metadata file defining replicates (I have one on osf):

curl -L https://osf.io/2d9em/download -o replicates_file

Then run clust:

clust clust_data/ -r replicates_file

Takes ~5 min! Grab output files:

scp -r ljcohen@<ip-address>:~/kfish_clust/Results_21_Jan_19 .

Voila! Clusters in the Clusters_profiles.pdf file:

Genes in the Clusters_Objects.tsv file:

Getting clusters of all genes across 17 species combined is not exactly answering my question/goal, which is to find the genes with expression patterns that are either in common or different between my 17 species of killifish. But, now I know how to run the program. The next step is to make separate expression tables for each species, then run clust, which will keep track of expression patterns for each species separately.

Output from clust:

ljcohen@js-168-95:/kfish_clust$ clust clust_data/ -r replicates_file 

/===========================================================================\
|                                   Clust                                   |
|    (Optimised consensus clustering of multiple heterogenous datasets)     |
|           Python package version 1.8.10 (2018) Basel Abu-Jamous           |
+---------------------------------------------------------------------------+
| Analysis started at: Monday 21 January 2019 (21:28:01)                    |
| 1. Reading dataset(s)                                                     |
| 2. Data pre-processing                                                    |
|  - Automatic normalisation mode (default in v1.7.0+).                     |
|    Clust automatically normalises your dataset(s).                        |
|    To switch it off, use the `-n 0` option (not recommended).             |
|    Check https://github.com/BaselAbujamous/clust for details.             |
|  - Flat expression profiles filtered out (default in v1.7.0+).            |
|    To switch it off, use the --no-fil-flat option (not recommended).      |
|    Check https://github.com/BaselAbujamous/clust for details.             |
| 3. Seed clusters production (the Bi-CoPaM method)                         |
| 10%                                                                       |
| 20%                                                                       |
| 30%                                                                       |
| 40%                                                                       |
| 50%                                                                       |
| 60%                                                                       |
| 70%                                                                       |
| 80%                                                                       |
| 90%                                                                       |
| 100%                                                                      |
| 4. Cluster evaluation and selection (the M-N scatter plots technique)     |
| 10%                                                                       |
| 20%                                                                       |
| 30%                                                                       |
| 40%                                                                       |
| 50%                                                                       |
| 60%                                                                       |
| 70%                                                                       |
| 80%                                                                       |
| 90%                                                                       |
| 100%                                                                      |
| 5. Cluster optimisation and completion                                    |
| 6. Saving results in                                                      |
| /kfish_clust/Results_21_Jan_19                                            |
+---------------------------------------------------------------------------+
| Analysis finished at: Monday 21 January 2019 (21:32:16)                   |
| Total time consumed: 0 hours, 4 minutes, and 14 seconds                   |
|                                                                           |
\===========================================================================/

/===========================================================================\
|                              RESULTS SUMMARY                              |
+---------------------------------------------------------------------------+
| Clust received 1 dataset with 27775 unique genes. After filtering, 20289  |
| genes made it to the clustering step. Clust generated 12 clusters of      |
| genes, which in total include 5941 genes. The smallest cluster includes   |
| 73 genes, the largest cluster includes 1972 genes, and the average        |
| cluster size is 495.083333333 genes.                                      |
+---------------------------------------------------------------------------+
|                                 Citation                                  |
|                                 ~~~~~~~~                                  |
| When publishing work that uses Clust, please include this citation:       |
| Basel Abu-Jamous and Steven Kelly (2018) Clust: automatic extraction of   |
| optimal co-expressed gene clusters from gene expression data. Genome      |
| Biology 19:172; doi: https://doi.org/10.1186/s13059-018-1536-8.           |
+---------------------------------------------------------------------------+
| For enquiries contact:                                                    |
| Basel Abu-Jamous                                                          |
| Department of Plant Sciences, University of Oxford                        |
| basel.abujamous@plants.ox.ac.uk                                           |
| baselabujamous@gmail.com                                                  |
\===========================================================================/

Also, I learned that clust will take output from OrthoFinder, which I have been previously exploring. I am very pleased with this possibility and will be exploring further with .pep files made with Transdecoder. The challenge now will be to quantify expression based on orthogroups.


Comments

comments powered by Disqus