astir.data package

Module contents

Classes:

SCDataset(expr_input, marker_dict, …[, …])

Container for single-cell proteomic data in the form of a pytorch dataset

Functions:

from_anndata_yaml(anndata_file, marker_yaml)

Create an Astir object from an anndata.Anndata file and a

from_csv_dir_yaml(input_dir, marker_yaml[, …])

Create an Astir object a directory containing multiple csv files

from_csv_yaml(csv_input, marker_yaml[, …])

Create an Astir object from an expression CSV and marker YAML

from_loompy_yaml(loom_file, marker_yaml[, …])

Create an Astir object from a loom file and a marker yaml

class astir.data.SCDataset(expr_input, marker_dict, include_other_column, design=None, dtype=torch.float64, device=device(type='cpu'))[source]

Bases: Generic[torch.utils.data.dataset.T_co]

Container for single-cell proteomic data in the form of a pytorch dataset

Parameters
  • expr_input (Union[DataFrame, Tuple[Union[array, Tensor], List[str], List[str]]]) – Input expression data. See details :expr_input is either a pd.DataFrame or a three-element tuple. When it is pd.DataFrame, its index and column should indicate the cell name and feature name of the dataset; when it is a three-element tuple, it should be in the form of Tuple[Union[np.array, torch.Tensor], List[str], List[str]] and its first element should be the actual dataset as either np.array or torch.tensor, the second element should be a list containing the name of the columns or the names of features, the third element should be a list containing the name of the indices or the names of the cells.:

  • marker_dict (Dict[str, List[str]]) – Marker dictionary containing cell type and information. See details :The dictionary maps the name of cell type/state to protein features. :

  • design (Union[DataFrame, array, None]) – A design matrix

  • include_other_column (bool) – Should an additional ‘other’ column be included?

  • dtype (dtype) – torch datatype of the model

Methods:

get_cell_names()

Get the cell names.

get_classes()

Get the cell types/states.

get_design()

Get the design matrix.

get_dtype()

Get the dtype of the SCDataset.

get_exprs()

Return the expression data as a torch.Tensor.

get_exprs_df()

Return the expression data as a pandas.DataFrame.

get_features()

Get the features (proteins).

get_marker_mat()

Return the marker matrix as a torch.Tensor.

get_mu()

Get the mean expression of each protein as a torch.Tensor.

get_mu_init([n_putative_cells])

Intelligent initialization for mu parameters

get_n_cells()

Get the number of cells: either the number of cell types or cell states.

get_n_classes()

Get the number of ‘classes’: either the number of cell types or cell states.

get_n_features()

Get the number of features (proteins).

get_sigma()

Get the standard deviation of each protein

normalize([percentile_lower, …])

Normalize the expression data

rescale()

Normalize the expression data.

get_cell_names()[source]

Get the cell names.

Returns

return self._cell_names

Return type

List[str]

get_classes()[source]

Get the cell types/states.

Returns

return self._classes

Return type

List[str]

get_design()[source]

Get the design matrix.

Returns

return self._design

Return type

torch.Tensor

get_dtype()[source]

Get the dtype of the SCDataset.

Returns

self._dtype

Return type

torch.dtype

get_exprs()[source]

Return the expression data as a torch.Tensor.

Return type

Tensor

get_exprs_df()[source]

Return the expression data as a pandas.DataFrame.

Return type

DataFrame

get_features()[source]

Get the features (proteins).

Returns

return self._m_features

Return type

List[str]

get_marker_mat()[source]

Return the marker matrix as a torch.Tensor.

Return type

Tensor

get_mu()[source]

Get the mean expression of each protein as a torch.Tensor.

Return type

Tensor

get_mu_init(n_putative_cells=10)[source]

Intelligent initialization for mu parameters

See manuscript for details

Parameters

n_putative_cells (int) – Number of cells to guess as given cell type

Return type

ndarray

get_n_cells()[source]

Get the number of cells: either the number of cell types or cell states.

Return type

int

get_n_classes()[source]

Get the number of ‘classes’: either the number of cell types or cell states.

Return type

int

get_n_features()[source]

Get the number of features (proteins).

Return type

int

get_sigma()[source]

Get the standard deviation of each protein

Return type

Tensor

Returns

standard deviation of each protein

normalize(percentile_lower=0, percentile_upper=99.9, cofactor=5.0)[source]

Normalize the expression data

This performs a two-step normalization: 1. A log(1+x) transformation to the data 2. Winsorizes to (percentile_lower, percentile_upper)

Parameters
  • percentile_lower (float) – the lower bound percentile for winsorization, defaults to 0

  • percentil_upper – the upper bound percentile for winsorization, defaults to 99.9

  • cofactor (float) – a cofactor constant, defaults to 5.0

Return type

None

rescale()[source]

Normalize the expression data.

Return type

None

astir.data.from_anndata_yaml(anndata_file, marker_yaml, protein_name=None, cell_name=None, batch_name='batch', create_design_mat=True, random_seed=1234, dtype=torch.float64)[source]
Create an Astir object from an anndata.Anndata file and a

marker yaml

Parameters
  • anndata_file (str) – Path to an anndata.Anndata h5py file

  • marker_yaml (str) – Path to input YAML file containing marker gene information. Should include cell_type and cell_state entries. See documention.

  • protein_name (Optional[str]) – The column of adata.var containing protein names. If this is none, defaults to adata.var_names

  • cell_name (Optional[str]) – The column of adata.obs containing cell names. If this is none, defaults to adata.obs_names

  • batch_name (str) – The column of adata.obs containing batch names. A design matrix will be built using this (if present) using a one-hot encoding to control for batch, defaults to ‘batch’

  • create_design_mat (bool) – Determines whether a design matrix is created. Defaults to True.

  • random_seed (int) – The random seed to be used to initialize variables, defaults to 1234

  • dtype (dtype) – datatype of the model parameters, defaults to torch.float64

Return type

Any

Returns

An object of class astir_bash.py.Astir using data imported from the loom files

astir.data.from_csv_dir_yaml(input_dir, marker_yaml, create_design_mat=True, random_seed=1234, dtype=torch.float64)[source]

Create an Astir object a directory containing multiple csv files

Parameters
  • input_dir (str) – Path to a directory containing multiple CSV files, each in the format expected by from_csv_yaml

  • marker_yaml (str) – Path to input YAML file containing marker gene information. Should include cell_type and cell_state entries. See documention.

  • design_csv – Path to design matrix as a CSV. Rows should be cells, and columns covariates. First column is cell identifier, and additional column names are covariate identifiers

  • create_design_mat (bool) – Determines whether a design matrix is created. Defaults to True.

  • random_seed (int) – The random seed to be used to initialize variables, defaults to 1234

  • dtype (dtype) – datatype of the model parameters, defaults to torch.float64

Return type

Any

astir.data.from_csv_yaml(csv_input, marker_yaml, design_csv=None, create_design_mat=True, random_seed=1234, dtype=torch.float64)[source]

Create an Astir object from an expression CSV and marker YAML

Parameters
  • csv_input (str) – Path to input csv containing expression for cells (rows) by proteins (columns). First column is cell identifier, and additional column names are gene identifiers.

  • marker_yaml (str) – Path to input YAML file containing marker gene information. Should include cell_type and cell_state entries. See documention.

  • design_csv (Optional[str]) – Path to design matrix as a CSV. Rows should be cells, and columns covariates. First column is cell identifier, and additional column names are covariate identifiers.

  • create_design_mat (bool) – Determines whether a design matrix is created. Defaults to True.

  • random_seed (int) – The random seed to be used to initialize variables, defaults to 1234

  • dtype (dtype) – datatype of the model parameters, defaults to torch.float64

Return type

Any

astir.data.from_loompy_yaml(loom_file, marker_yaml, protein_name_attr='protein', cell_name_attr='cell_name', batch_name_attr='batch', create_design_mat=True, random_seed=1234, dtype=torch.float64)[source]

Create an Astir object from a loom file and a marker yaml

Parameters
  • loom_file (str) – Path to a loom file, where rows correspond to proteins and columns to cells

  • marker_yaml (str) – Path to input YAML file containing marker gene information. Should include cell_type and cell_state entries. See documention.

  • protein_name_attr (str) – The attribute (key) in the row attributes that identifies the protein names (required to match with the marker gene information), defaults to protein

  • cell_name_attr (str) – The attribute (key) in the column attributes that identifies the name of each cell, defaults to cell_name

  • batch_name_attr (str) – The attribute (key) in the column attributes that identifies the batch. A design matrix will be built using this (if present) using a one-hot encoding to control for batch, defaults to batch

  • create_design_mat (bool) – Determines whether a design matrix is created. Defaults to True.

  • random_seed (int) – The random seed to be used to initialize variables, defaults to 1234

  • dtype (dtype) – datatype of the model parameters, defaults to torch.float64

Return type

Any

Returns

An object of class astir_bash.py.Astir using data imported from the loom files