multi-modal-imputation-spatial-gene-expression

Imputation of Spatial Gene Expression with a Multi Modality Deep Learning Network

Final project submission for MSC in Machine Learning / Data Science, Reichman University.

Submitted by: Itay Ben Shushan.

Supervised by: Leon Anavy.

Brief Background

Spatial RNA Sequencing is a recent advance in RNA sequencing, which performs a similar process as single-cell RNA sequencing (scRNA-Seq), while also maintaining the original position of the cell in the tissue. The output of Spatial RNA Sequencing is a matrix, where each column represents a single gene, each row represents a single spot in the tissue, and the value of each cell represents the number of copies found - of a specific gene in a specific spot. Each of the spots then also has extra positional metadata relative to the tissue.

In some databases, the output of the Spatial RNA-seq also includes a high resolution image of the analyzed slice after it underwent histological staining. Each spot is then associated with a specific X,Y pixel coordinates on the image.

The image below shows a sample slice on the left, and the spot positions outlined on top of the slice on the right.

Due to restrictions of the methods in both scRNA-Seq and Spatial RNA-Seq (which stem from having a very small cell-count for the sequencing process) the output matrix contains a significant amount of zeros which don’t always indicate biologically-true absence of expression, but rather some are a cause of methodological noise. This sparsity imposes difficulties on downstream analysis’ of the sequencing data, and performing some form of imputation on single cell transcription data is a frequent practice.

Goal

The goal of this work is to utilize information available in Spatial RNA Sequencing to perform imputation of the gene-count matrix. Specifically, we intend to utilize three modalities to perform an imputed reconstruction of the gene-count matrix:

In this work we experiment with imputation in each one of the above approaches individually, and then present a network which uses all modalities in a unified architecture. This final architecture is outlined in the figure below.

Results - summary

First, we drop a significant amount (~50%) of the non-zero elements in the original matrix, and reserve them as a test set. The resulting dropped matrix is then used as the training set. After training the network with the dropped matrix we can then built an imputed result matrix. We then perform both quantitative and qualitative analysis on the result.

For quantitative analysis we compare the original dropped values with the same indices in the imputed matrix, and measure the RMSE. See the table below.

For qualitative analysis we perform cell-type clustering on three different matrices: the original matrix, the post-dropout training set matrix, and the imputed matrix. In successful efforts we expect to see that the resulting cluster distribution on the imputed matrix has a similar pattern as the original pre-dropout matrix. See figure below.

Discussion - summary

Viewing the quantitative results in the above table as well as the qualitative results in the above plot, we see a few points worth mentioning: