Unprobed SeqFISH gene analysis

This notebook demonstrates imputation of unprobed genes for seqFISH ST dataset.

Input data can be downloaded from seqfish.tar.gz, please extract it with

tar -xzvf seqfish.tar.gz

and change the variable pre_datapath in the next cell accordingly.

[ ]:

import torch
import warnings
import pickle

import squidpy as sq
import numpy as np
import scanpy as sc
import pandas as pd

from matplotlib import pyplot as plt
import seaborn as sns
from transpa.util import expTransImp


warnings.filterwarnings('ignore')
pre_datapath = "../../output/preprocessed_dataset/seqFISH_single_cell.pkl"

seed = 10
device = torch.device("cuda:0") if torch.cuda.is_available() else torch.device("cpu")

Step 1. Load preprocessed SeqFISH and single-cell reference datasets.

Retrieve leiden clusters if uncertainty estimation is required

[2]:

with open(pre_datapath, 'rb') as infile:
    spa_adata, scrna_adata, raw_spatial_df, raw_scrna_df, raw_shared_gene = pickle.load(infile)
    cls_key = 'leiden'
    classes = scrna_adata.obs[cls_key]
    ct_list = np.unique(classes)

Step 2. Select genes from single cell reference data

2.1 Select top-1000 highly variable genes

[3]:

sc.pp.highly_variable_genes(scrna_adata, n_top_genes=1000)
scrna_adata

[3]:

AnnData object with n_obs × n_vars = 32844 × 29452
    obs: 'cell', 'barcode', 'sample', 'pool', 'stage', 'sequencing.batch', 'theiler', 'doub.density', 'doublet', 'cluster', 'cluster.sub', 'cluster.stage', 'cluster.theiler', 'stripped', 'celltype', 'colour', 'sizeFactor', 'leiden'
    var: 'ENSEMBL', 'SYMBOL', 'SymbolUniq', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    uns: 'log1p', 'hvg'

2.2 Uniformly select 1000 more genes

Uniformly select 1000 more genes from single-cell reference, take union between highly-variable and random genes, and exclude probed 351 genes to retain the final unprobed genes

[4]:

test_genes = np.setdiff1d(scrna_adata.var_names[scrna_adata.var['highly_variable']], raw_shared_gene)
n_valid_hc_sc_genes = len(test_genes)
print(len(test_genes), " valid hv sc genes")
np.random.seed(42)
test_genes = np.setdiff1d(np.union1d(test_genes, np.random.choice(scrna_adata.var_names, 1000, replace=False)), raw_shared_gene)
len(test_genes)

802  valid hv sc genes

[4]:

Step 3. Run TransImpLR with uncertainty score estimation

Run low-rank mode, raw cell as gene profile, and turn on uncertainty estimation with 200 local bootstrap samples.

[5]:

res = expTransImp(
        df_ref=raw_scrna_df,
        df_tgt=raw_spatial_df,
        train_gene=raw_shared_gene,
        test_gene=np.concatenate([raw_shared_gene, test_genes]),
        n_simulation=200,
        signature_mode='cell',
        mapping_mode='lowrank',
        classes=classes,
        n_epochs=2000,
        seed=seed,
        device=device)

[TransImp] Epoch: 2000/2000, loss: 0.781033, (IMP) 0.781033: 100%|██████████| 2000/2000 [00:32<00:00, 61.25it/s]

Step 4. Prepare adata object for imputation

[6]:

imp_adata = sc.AnnData(res[0])
imp_adata.var_names = np.concatenate([raw_shared_gene, test_genes])
imp_adata.obs = spa_adata.obs.copy()
imp_adata.obsm = spa_adata.obsm
imp_adata.obsp = spa_adata.obsp

Step 5. Compute spatial autocorrelation with Squidpy

[7]:

sq.gr.spatial_autocorr(
    imp_adata,
    n_jobs=10,
)

Step 6. Rank unprobed genes by uncertainty scores (sigmoid transformed)

[8]:

df_var = pd.DataFrame(res[1], index=np.concatenate([raw_shared_gene, test_genes]), columns=['pred_var'])
df_var.loc[test_genes].sort_values(axis=0, by='pred_var', ascending=True)

[8]:

	pred_var
Rplp0	0.498482
Rpl13	0.498531
Rps27a	0.498591
Rps6	0.498615
Rps14	0.498619
...	...
Gm15290	0.500156
Gm14344	0.500156
Gm14243	0.500156
Gm14174	0.500156
Gm29129	0.500156

1754 rows × 1 columns

Step 7. Merge unprobed and observed probed genes and test cell type markers

Figures below show a lot of unrpbed genes (name with postfix “_imp”) are higher ranked markers

[9]:

merged_adata = sc.AnnData(np.hstack([res[0][:, len(raw_shared_gene):], spa_adata.X.toarray()]))
merged_adata.var_names = np.concatenate([[f'{g}_imp' for g in test_genes], spa_adata.var_names])
merged_adata.obs = spa_adata.obs.copy()
merged_adata.obsm = spa_adata.obsm
merged_adata.obsp = spa_adata.obsp
sq.gr.spatial_autocorr(
    merged_adata,
    n_jobs=10,
)
sc.tl.rank_genes_groups(merged_adata, 'celltype_mapped_refined', method='wilcoxon')
sc.pl.rank_genes_groups(merged_adata, n_genes=25, sharey=False)

[10]:

ratio_imp = []
for jth_ct in range(len(merged_adata.uns['rank_genes_groups']['names'][0])):
    imp_counter = 0
    for ith in range(25):
        # print(merged_adata.uns['rank_genes_groups']['names'][ith][jth_ct])
        # raise
        if merged_adata.uns['rank_genes_groups']['names'][ith][jth_ct].endswith("_imp"):
            imp_counter += 1
    ratio_imp.append(imp_counter / 25)

Step 8. Plot ratios of unprobed markers in each cell type

[11]:

sns.set(rc={'figure.figsize':(10,4), 'figure.dpi':100})

sns.barplot(x=pd.Series(np.unique(merged_adata.obs.celltype_mapped_refined), name='Cell type'), color='blue', y=pd.Series(ratio_imp, name="Ratio of Imputed Marker Genes"))
plt.xticks(rotation=45, ha='right', rotation_mode='anchor')

[11]:

(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
        17, 18, 19, 20, 21, 22, 23]),
 [Text(0, 0, 'Allantois'),
  Text(1, 0, 'Anterior somitic tissues'),
  Text(2, 0, 'Blood progenitors'),
  Text(3, 0, 'Cardiomyocytes'),
  Text(4, 0, 'Cranial mesoderm'),
  Text(5, 0, 'Definitive endoderm'),
  Text(6, 0, 'Dermomyotome'),
  Text(7, 0, 'Endothelium'),
  Text(8, 0, 'Erythroid'),
  Text(9, 0, 'ExE endoderm'),
  Text(10, 0, 'Forebrain/Midbrain/Hindbrain'),
  Text(11, 0, 'Gut tube'),
  Text(12, 0, 'Haematoendothelial progenitors'),
  Text(13, 0, 'Intermediate mesoderm'),
  Text(14, 0, 'Lateral plate mesoderm'),
  Text(15, 0, 'Low quality'),
  Text(16, 0, 'Mixed mesenchymal mesoderm'),
  Text(17, 0, 'NMP'),
  Text(18, 0, 'Neural crest'),
  Text(19, 0, 'Presomitic mesoderm'),
  Text(20, 0, 'Sclerotome'),
  Text(21, 0, 'Spinal cord'),
  Text(22, 0, 'Splanchnic mesoderm'),
  Text(23, 0, 'Surface ectoderm')])

[12]:

def plot(genes, spa_adata, n_cols=5, figsize=(20, 20), dpi=100):
    genes = list(set(list(genes)))
    sns.set(rc={'figure.figsize':figsize, 'figure.dpi': dpi})
    plt.rcParams.update({"font.size":30, 'axes.titlesize':30})
    pre_exprs = [spa_adata[:, gene].X.toarray().flatten() for gene in set(list(genes))]
    exprs = []
    for expr in pre_exprs:
        qt = np.quantile(expr, 0.1)
        expr[expr < qt] = qt
        # qt = np.quantile(expr, 0.85)
        # expr[expr > qt] = qt
        exprs.append(expr)
    titles = [f"{gene}\n M.I. {spa_adata.uns['moranI'].loc[gene].I:.3f}" for gene in genes]
    # sc.pl.spatial(spa_adata, color=[gene], spot_size=0.1, title='Truth')


    tmp_adata = sc.AnnData(np.array(exprs).T)
    print(tmp_adata)
    tmp_adata.var_names = titles
    tmp_adata.obsm['spatial'] = spa_adata.obsm['spatial'].copy()
    sc.pl.spatial(tmp_adata, color=titles, spot_size=0.1, title=titles, color_map='OrRd', legend_fontsize=10, wspace=0.00001, hspace=0.5, ncols=n_cols)

Step 9. Plot the top-1 markers in each cell type

Postfix “_imp” indicates unprobed genes

[13]:

plot(list(merged_adata.uns['rank_genes_groups']['names'][0]), merged_adata, n_cols=8, figsize=(6,6), dpi=100)

AnnData object with n_obs × n_vars = 57536 × 23

[ ]: