Loading data into astir¶

Table of Contents:¶

0 Loading necessary libraries
1 Starting Astir within python
2 Loading from csv and yaml files
3 Loading from a directory of csvs and yaml
4 Loading from loom
5 Loading from anndata
6 Loading model from hdf5
7 Converting rds file (with data as SingleCellExperiment) to csv file

0. Loading necessary libraries and define paths¶

[1]:

# !pip install -e ../../..
import os
import sys

module_path = os.path.abspath(os.path.join('../../..'))
if module_path not in sys.path:
    sys.path.append(module_path)
module_path = os.path.abspath(os.path.join('../../../astir'))
if module_path not in sys.path:
    sys.path.append(module_path)
print(sys.path)
%load_ext autoreload
%autoreload 2

['/Users/jinelles.h/Documents/Camlab/astir-top-level/astir/docs/tutorials/notebooks', '/Users/jinelles.h/opt/anaconda3/lib/python37.zip', '/Users/jinelles.h/opt/anaconda3/lib/python3.7', '/Users/jinelles.h/opt/anaconda3/lib/python3.7/lib-dynload', '', '/Users/jinelles.h/.local/lib/python3.7/site-packages', '/Users/jinelles.h/opt/anaconda3/lib/python3.7/site-packages', '/Users/jinelles.h/opt/anaconda3/lib/python3.7/site-packages/aeosa', '/Users/jinelles.h/Documents/Camlab/astir-top-level/astir', '/Users/jinelles.h/Documents/Camlab/astir-top-level/margo', '/Users/jinelles.h/opt/anaconda3/lib/python3.7/site-packages/IPython/extensions', '/Users/jinelles.h/.ipython', '/Users/jinelles.h/Documents/Camlab/astir-top-level/astir/astir']

[2]:

import yaml
import pandas as pd
import astir as ast
import numpy as np
import torch

[3]:

yaml_marker_path = "../../../tests/test-data/jackson-2020-markers.yml"
design_mat_path = "../../../tests/test-data/design.csv"
expression_mat_path = "../../../tests/test-data/test_data.csv"
expression_dir_path = "../../../tests/test-data/test-dir-read"
expression_loom_path = "../../../tests/test-data/basel_100.loom"
expression_anndata_path="../../../tests/test-data/adata_small.h5ad"

1. Starting Astir within python¶

The input dataset should represent protein expression in single cells. The rows should represent each cell (one row per cell) and the columns should represent each protein (one column per protein). A marker which maps the features (proteins) to cell type/state may is required. A design matrix is optional. If provided, it should be either np.array or pd.DataFrame.

The initialization of Astir requires input dataset input_expr as one of pd.DataFrame, Tuple[np.array, List[str], List[str]] and Tuple[SCDataset, SCDataset].

Note: dtype and random_seed are always customizable. dtype is default to torch.float64 and random_seed is default to 1234.

1.0 Loading marker dictionary and design matrix¶

Marker Dictionary¶

[4]:

with open(yaml_marker_path, "r") as stream:
    marker_dict = yaml.safe_load(stream)
print(marker_dict)

{'cell_states': {'RTK_signalling': ['Her2', 'EGFR'], 'proliferation': ['Ki-67', 'phospho Histone'], 'mTOR_signalling': ['phospho mTOR', 'phospho S6'], 'apoptosis': ['cleaved PARP', 'Cleaved Caspase3']}, 'cell_types': {'stromal': ['Vimentin', 'Fibronectin'], 'B cells': ['CD45', 'CD20'], 'T cells': ['CD45', 'CD3'], 'macrophage': ['CD45', 'CD68'], 'epithelial(basal)': ['E-Cadherin', 'pan Cytokeratin', 'Cytokeratin 5', 'Cytokeratin 14', 'Her2'], 'epithelial(luminal)': ['E-Cadherin', 'pan Cytokeratin', 'Cytokeratin 7', 'Cytokeratin 8/18', 'Cytokeratin 19', 'Cytokeratin 5', 'Her2']}, 'hierarchy': {'epithelial_cells': ['epithelial(luminal)', 'epithelial(basal)'], 'immune cells': {'non-lymphocytes': ['macrophage'], 'lymphocytes': ['T cells', 'B cells']}}}

Some notes:

The marker marker_dict is not required when input_expr is Tuple[SCDataset, SCDataset]. Otherwise, it is required to be Dict[str, Dict[str, str]].
The outer dictionary may have at most three keys: cell_type, cell_state and hierarchy. cell_type and cell_state maps to the corresponding dictionary which maps the name of cell type/state to protein features. hierarchy maps to the dictionary which indicates the cell type hierarchy.
If the user is only intended to classify one of cell type and cell state, only the intended marker dictionary should be provided. So that marker_dict is one of {"cell_state": {...}}, {"cell_type": {...}} and {"cell_type": {...}, "cell_state": {...}}.
The hierarchy dictionary should be included when the client tends to call Astir.assign_celltype_hierarchy()

Design matrix:¶

Note that the design matrix must have the same number of rows as there are number of cells.

[5]:

design_df = pd.read_csv(design_mat_path, index_col=0)
print(design_df.shape)

(49, 40)

Note: design is not necessary when input_expr is Tuple[SCDataset, SCDataset]. Otherwise it is optional.

1.1 Loading data as `pd.DataFrame`¶

When the input is pd.DataFrame, its row and column should respectively represent the cells and the features (proteins).

[6]:

df_expr = pd.read_csv(expression_mat_path, index_col=0)
df_expr.head()

[6]:

	EGFR	Ruthenium_1	Ruthenium_2	Ruthenium_3	Ruthenium_4	Ruthenium_5	Ruthenium_6	Ruthenium_7	E-Cadherin	DNA1	...	CD45	CD68	CD3	Carbonic Anhydrase IX	Cytokeratin 8/18	Cytokeratin 7	Twist	phospho Histone	phospho mTOR	phospho S6
BaselTMA_SP41_126_X14Y7_1	0.281753	1.319588	0.597380	1.782863	1.757824	1.991857	2.580564	2.287167	1.814309	2.261638	...	0.044733	0.184805	0.000000	0.928929	0.025526	0.043423	0.209742	0.137454	0.572811	0.215508
BaselTMA_SP41_126_X14Y7_2	0.303016	1.319588	0.597380	1.782863	1.757824	1.991857	2.580564	2.287167	1.517685	1.613060	...	0.046802	0.080406	0.110806	0.752101	0.000000	0.032056	0.108013	0.048428	0.539647	0.655731
BaselTMA_SP41_126_X14Y7_3	0.252374	1.319588	0.597380	1.782863	1.757824	1.991857	2.580564	2.287167	1.246433	2.138744	...	0.028499	0.203248	0.020617	0.740759	0.083311	0.081503	0.119058	0.063097	0.409735	0.437845
BaselTMA_SP41_126_X14Y7_4	0.397732	1.306852	0.534496	1.678217	1.757824	1.961430	2.528551	2.183814	1.839785	1.816015	...	0.069053	0.305200	0.060264	1.095968	0.184603	0.131531	0.160778	0.090666	0.305718	0.132236
BaselTMA_SP41_126_X14Y7_5	0.426352	1.173439	0.597380	1.589303	1.389839	1.789887	2.343743	2.123334	1.618347	1.355214	...	0.233777	0.135084	0.057195	1.427983	0.035371	0.038448	0.014434	0.127032	0.261205	0.157786

5 rows × 45 columns

[7]:

a_df = ast.Astir(input_expr=df_expr, marker_dict=marker_dict, design=design_df)
print(a_df)

Astir object, 6 cell types, 4 cell states, 49 cells

1.2 Loading data as `np.array` or `torch.tensor`¶

When the input is Tuple[Union[np.array, torch.tensor]], List[str], List[str]], the first element np.array or torch.tensor is the input dataset, the second element List[str] is the title of the columns (the names of proteins) and the third element List[str] is the title of the rows (the name　of the cells). The length of the second and third list should be equal to the number of columns and rows of the first array.

[8]:

# Load as np.array
np_expr = df_expr.values
features = list(df_expr.columns)
cores = list(df_expr.index)
a_np = ast.Astir(input_expr=(np_expr, features, cores), marker_dict=marker_dict, design=design_df)
print(a_np)

Astir object, 6 cell types, 4 cell states, 49 cells

[9]:

# Load as torch.tensor
t_expr = torch.from_numpy(np_expr)
a_t = ast.Astir(input_expr=(t_expr, features, cores), marker_dict=marker_dict, design=design_df)
print(a_t)

Astir object, 6 cell types, 4 cell states, 49 cells

1.3 Loading data as `SCDataset`¶

When the input is Tuple[SCDataset, SCDataset], the first SCDataset should be the cell type dataset and the second SCDataset should be the cell state dataset.

[10]:

type_scd = ast.SCDataset(expr_input=df_expr, marker_dict=marker_dict["cell_types"],
                         include_other_column=True, design=design_df)
state_scd = ast.SCDataset(expr_input=df_expr, marker_dict=marker_dict["cell_states"],
                          include_other_column=False, design=design_df)
a_scd = ast.Astir(input_expr=(type_scd, state_scd))
print(a_scd)

Astir object, 6 cell types, 4 cell states, 49 cells

2. Loading from csv and yaml files¶

A data reader from_csv_yaml for loading csv and yaml file is provided.

The row of the csv file should represent the information of each single cells and the column of the csv file should represent the expression of each protein in different cells.

[11]:

a_csv = ast.from_csv_yaml(csv_input=expression_mat_path, marker_yaml=yaml_marker_path, design_csv=design_mat_path)
print(a_csv)

Astir object, 6 cell types, 4 cell states, 49 cells

Some notes:

The yaml file at yaml_marker_path contains the marker which maps protein features to cell type/state classes. The format should match the description of *marker dictionary.
from_csv_yaml returns an Astir object.
dtype and random_seed are also customizable. dtype is default to torch.float64 and random_seed is default to 1234.

[12]:

type(a_csv.get_type_dataset().get_exprs())

[12]:

torch.Tensor

[13]:

a_csv.get_type_dataset().get_exprs_df().head()

[13]:

	CD20	CD3	CD45	CD68	Cytokeratin 14	Cytokeratin 19	Cytokeratin 5	Cytokeratin 7	Cytokeratin 8/18	E-Cadherin	Fibronectin	Her2	Vimentin	pan Cytokeratin
BaselTMA_SP41_126_X14Y7_1	0.207884	0.000000	0.044733	0.184805	0.134128	0.079956	0.178350	0.043423	0.025526	1.814309	1.039734	0.483007	0.444140	1.187512
BaselTMA_SP41_126_X14Y7_2	0.021506	0.110806	0.046802	0.080406	0.026951	0.066922	0.081147	0.032056	0.000000	1.517685	1.147644	0.513386	0.270070	0.749379
BaselTMA_SP41_126_X14Y7_3	0.008878	0.020617	0.028499	0.203248	0.023515	0.186294	0.076112	0.081503	0.083311	1.246433	0.988906	0.633226	0.233909	1.216521
BaselTMA_SP41_126_X14Y7_4	0.053027	0.060264	0.069053	0.305200	0.114420	0.346273	0.164059	0.131531	0.184603	1.839785	0.842710	0.709272	0.542362	1.354303
BaselTMA_SP41_126_X14Y7_5	0.019127	0.057195	0.233777	0.135084	0.055368	0.124407	0.095323	0.038448	0.035371	1.618347	1.073357	0.482230	0.759944	0.629398

3. Loading from a directory of csvs and yaml¶

The user can also load the data from a directory of csv files with from_csv_dir_yaml.

In this case, every csv file should represent the expression data from different samples. A design matrix will be generated automatically.

[14]:

a_dir = ast.from_csv_dir_yaml(input_dir=expression_dir_path, marker_yaml=yaml_marker_path)
print(a_dir)

Astir object, 6 cell types, 4 cell states, 40 cells

Some notes:

The yaml file at yaml_marker_path contains the marker which maps protein features to cell type/state classes. The format should match the description of *marker dictionary.
from_csv_dir_yaml returns an Astir object.
dtype and random_seed are also customizable. dtype is default to torch.float64 and random_seed is default to 1234.

[15]:

type(a_dir.get_type_dataset().get_exprs())

[15]:

torch.Tensor

[16]:

a_dir.get_type_dataset().get_exprs_df().head()

[16]:

	CD20	CD3	CD45	CD68	Cytokeratin 14	Cytokeratin 19	Cytokeratin 5	Cytokeratin 7	Cytokeratin 8/18	E-Cadherin	Fibronectin	Her2	Vimentin	pan Cytokeratin
BaselTMA_SP41_126_X14Y7_1	0.207884	0.000000	0.044733	0.184805	0.134128	0.079956	0.178350	0.043423	0.025526	1.814309	1.039734	0.483007	0.444140	1.187512
BaselTMA_SP41_126_X14Y7_2	0.021506	0.110806	0.046802	0.080406	0.026951	0.066922	0.081147	0.032056	0.000000	1.517685	1.147644	0.513386	0.270070	0.749379
BaselTMA_SP41_126_X14Y7_3	0.008878	0.020617	0.028499	0.203248	0.023515	0.186294	0.076112	0.081503	0.083311	1.246433	0.988906	0.633226	0.233909	1.216521
BaselTMA_SP41_126_X14Y7_4	0.053027	0.060264	0.069053	0.305200	0.114420	0.346273	0.164059	0.131531	0.184603	1.839785	0.842710	0.709272	0.542362	1.354303
BaselTMA_SP41_126_X14Y7_5	0.019127	0.057195	0.233777	0.135084	0.055368	0.124407	0.095323	0.038448	0.035371	1.618347	1.073357	0.482230	0.759944	0.629398

4. Loading from loom¶

It is also possible to load the data from a loom file with from_loompy_yaml.

[17]:

a_loom = ast.from_loompy_yaml(loom_file=expression_loom_path, marker_yaml=yaml_marker_path,
                         protein_name_attr="protein", cell_name_attr="cell_name", batch_name_attr="batch")
print(a_loom)

Astir object, 6 cell types, 4 cell states, 100 cells

Some notes:

The protein and cell names are taken from ds.ra[protein_name_attr] and ds.ca[cell_name_attr] respectively if specified, and ds.ra["protein"] and ds.cs["cell_name"] otherwise.
If batch_name is sepecified, the corresponding column of ds.ca[batch_name_attr] will be assumed as the batch variable and turned into a design matrix. Otherwise it is taken as ds.ca["batch"]
The yaml file at yaml_marker_path contains the marker which maps protein features to cell type/state classes. The format should match the description of *marker dictionary.
from_loom_yaml returns an Astir object.
dtype and random_seed are also customizable. dtype is default to torch.float64 and random_seed is default to 1234.

[18]:

type(a_loom.get_type_dataset().get_exprs())

[18]:

torch.Tensor

[19]:

a_loom.get_type_dataset().get_exprs_df()

[19]:

	CD20	CD3	CD45	CD68	Cytokeratin 14	Cytokeratin 19	Cytokeratin 5	Cytokeratin 7	Cytokeratin 8/18	E-Cadherin	Fibronectin	Her2	Vimentin	pan Cytokeratin
BaselTMA_SP41_44_X2Y7_726	0.080173	0.010469	0.013425	0.149373	0.128329	1.577395	0.210581	0.661001	1.777394	2.028177	1.606446	0.803987	0.279106	4.026491
BaselTMA_SP41_Liver_X2Y1_1909	0.051691	0.139216	0.061985	0.140193	0.208142	0.129807	0.117969	0.000000	1.211757	0.642365	0.943650	1.488154	0.000000	0.743843
BaselTMA_SP41_231_X6Y6_10_798	0.000000	0.078386	0.144959	0.570016	0.158847	0.325296	0.129998	0.000000	0.166604	0.699211	2.967311	0.891388	1.845164	0.137033
BaselTMA_SP41_141_X11Y2_4968	0.039043	0.028426	0.089894	0.089386	0.075023	0.294802	0.130921	0.105543	0.721960	1.462543	0.607401	0.847732	0.032395	2.527473
BaselTMA_SP41_141_X11Y2_746	0.079079	0.184354	1.174959	0.297893	0.039844	0.177649	0.129131	0.056974	0.017406	0.391993	1.529043	1.196052	0.816324	0.145410
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
BaselTMA_SP42_25_X3Y2_1178	0.110930	0.022230	0.031842	0.076643	0.128341	0.845475	0.036395	0.143434	1.005111	1.917632	0.455979	0.994131	0.314468	2.820787
BaselTMA_SP43_272_X11Y3_460	0.057730	0.124963	0.000000	0.000000	0.000000	0.137121	0.201029	0.000000	0.075655	0.804087	2.755348	0.286016	1.194052	0.000000
BaselTMA_SP42_192_X8Y5_2214	0.053447	0.080335	0.031003	0.050570	0.095160	0.605501	0.156250	0.896955	1.221076	1.452352	0.302654	2.217377	0.687913	2.132569
BaselTMA_SP41_203_X8Y8_2433	0.028729	0.030617	0.322813	1.180702	0.081800	0.182332	0.083335	0.037738	0.212589	1.547722	2.593907	0.265961	0.074016	1.381932
BaselTMA_SP41_249_X3Y9_996	0.228180	0.117249	0.509954	1.180713	0.167626	0.107251	0.116069	0.026734	0.265320	0.502906	1.608899	0.579117	1.673233	1.438507

100 rows × 14 columns

5. Loading from anndata¶

We can read in data from the AnnData format, along with a yaml file containing marker information using the from_anndata_yaml function. We have temporarily disabled this example while the anndata format is standardized:

[1]:

if False:
    a_ann = ast.from_anndata_yaml(anndata_file=expression_anndata_path, marker_yaml=yaml_marker_path,
                            protein_name="protein",cell_name="cell_name", batch_name="batch")
    print(a_ann)

Some notes:

The protein and cell names are taken from adata.var[protein_name] and adata.obs[cell_name] respectively if specified, and adata.var_names and adata.obs_names otherwise.
If batch_name is sepecified, the corresponding column of adata.var will be assumed as the batch variable and turned into a design matrix.
The yaml file at yaml_marker_path contains the marker which maps protein features to cell type/state classes. The format should match the description of *marker dictionary.
from_anndata_yaml returns an Astir object.
dtype and random_seed are also customizable. dtype is default to torch.float64 and random_seed is default to 1234.

[21]:

if False:
    type(a_ann.get_type_dataset().get_exprs())

[21]:

torch.Tensor

[22]:

if False:
    a_ann.get_type_dataset().get_exprs_df()

[22]:

	CD20	CD3	CD45	CD68	Cytokeratin 14	Cytokeratin 19	Cytokeratin 5	Cytokeratin 7	Cytokeratin 8/18	E-Cadherin	Fibronectin	Her2	Vimentin	pan Cytokeratin
ZTMA208_slide_11_By5x8_1	0.168521	0.090277	0.271871	0.412439	0.087354	0.155710	0.100308	0.000000	0.096674	0.974271	2.867470	0.552905	2.335253	1.361075
ZTMA208_slide_11_By5x8_2	0.366301	0.352614	0.284034	0.312862	0.152354	0.508728	0.028651	0.029904	0.749755	2.787740	2.174494	1.046198	0.285699	2.454543
ZTMA208_slide_11_By5x8_3	0.177006	0.103808	0.150791	0.122472	0.292241	0.634366	0.090457	0.056627	0.446911	1.927940	2.997043	1.020517	2.887193	2.590460
ZTMA208_slide_11_By5x8_4	0.304068	0.222802	0.219736	0.277622	0.373870	2.212514	0.304824	0.000000	1.904837	3.175959	1.598163	2.269974	0.877098	4.250308
ZTMA208_slide_11_By5x8_5	0.137789	0.130010	0.105604	1.035280	0.212105	0.144144	0.074692	0.000000	0.000000	1.900182	2.326346	0.610897	2.882146	0.275225
ZTMA208_slide_11_By5x8_6	0.182926	0.169596	0.270698	0.257178	0.224863	1.143546	0.189600	0.001542	0.650384	2.580153	1.891692	1.724237	1.931947	2.994441
ZTMA208_slide_11_By5x8_7	0.239257	0.149007	0.351788	0.138080	0.142505	1.415104	0.124484	0.001245	1.091975	2.696699	1.994174	1.796137	0.127125	3.523499
ZTMA208_slide_11_By5x8_8	0.175299	0.153332	0.215698	0.104709	0.237387	2.190369	0.264600	0.000000	1.457901	2.788996	1.859896	1.726696	0.106661	4.245234
ZTMA208_slide_11_By5x8_9	0.210541	0.118273	0.146135	0.148164	0.362226	1.267224	0.173477	0.000000	0.842407	2.950440	1.852758	2.183716	0.957369	3.098247
ZTMA208_slide_11_By5x8_10	0.308899	0.326121	0.224866	0.276182	0.140240	2.032473	0.334358	0.000000	1.503531	2.938590	2.192502	2.312838	1.337983	4.199266

6. Loading model from hdf5¶

The whole model could be saved with Astir.save_model(<hdf5_name>), and the saved hdf5 file could be used to load a new model. This makes it possible to access the previously trained parameters and assignments without having to train the model again.

[23]:

# Save a trained model
hdf5_summary = "./data/a_summary.hdf5"
a_orig = ast.Astir(input_expr=df_expr, marker_dict=marker_dict, design=design_df)
a_orig.fit_type(max_epochs=5, n_init=1, n_init_epochs=1)
a_orig.fit_type(max_epochs=5, n_init=1, n_init_epochs=1)
a_orig.save_models(hdf5_summary)

training restart 1/1: 100%|██████████| 1/1 [68.35epochs/s, current loss: 286.9]
training restart (final): 100%|██████████| 5/5 [90.87epochs/s, current loss: 229.9]
/Users/jinelles.h/Documents/Camlab/astir-top-level/astir/astir/astir.py:178: UserWarning: Maximum epochs reached. More iteration may be needed to complete the training.
  warnings.warn(msg)
training restart 1/1: 100%|██████████| 1/1 [123.53epochs/s, current loss: 286.9]
training restart (final): 100%|██████████| 5/5 [91.99epochs/s, current loss: 229.9]

[24]:

# Load a trained model
a_load = ast.Astir()
a_load.load_model(hdf5_summary)
print(a_load)

Astir object

Some notes: 1. In the example, we didn’t load any dataset, which means we can’t get access to the original dataset on which the model was trained. 2. The new model could still be used to predict cell type/status, etc. 3. If the user want to do further operation which requires access to the original dataset, simply load it again when initializing:

a_load = ast.Astir(input_expr=df_expr, marker_dict=marker_dict, design=design_df).

Loading data into astir¶

Table of Contents:¶

0. Loading necessary libraries and define paths¶

1. Starting Astir within python¶

1.0 Loading marker dictionary and design matrix¶

Marker Dictionary¶

Design matrix:¶

1.1 Loading data as pd.DataFrame¶

1.2 Loading data as np.array or torch.tensor¶

1.3 Loading data as SCDataset¶

2. Loading from csv and yaml files¶

3. Loading from a directory of csvs and yaml¶

4. Loading from loom¶

5. Loading from anndata¶

6. Loading model from hdf5¶

1.1 Loading data as `pd.DataFrame`¶

1.2 Loading data as `np.array` or `torch.tensor`¶

1.3 Loading data as `SCDataset`¶