astir.data package¶

Module contents¶

Classes:

SCDataset(expr_input, marker_dict, …[, …])

Container for single-cell proteomic data in the form of a pytorch dataset

Functions:

`from_anndata_yaml`(anndata_file, marker_yaml)	Create an Astir object from an `anndata.Anndata` file and a
`from_csv_dir_yaml`(input_dir, marker_yaml[, …])	Create an Astir object a directory containing multiple csv files
`from_csv_yaml`(csv_input, marker_yaml[, …])	Create an Astir object from an expression CSV and marker YAML
`from_loompy_yaml`(loom_file, marker_yaml[, …])	Create an Astir object from a loom file and a marker yaml

class astir.data.SCDataset(expr_input, marker_dict, include_other_column, design=None, dtype=torch.float64, device=device(type='cpu'))[source]¶

Bases: Generic[torch.utils.data.dataset.T_co]

Container for single-cell proteomic data in the form of a pytorch dataset

Parameters

expr_input (Union[DataFrame, Tuple[Union[array, Tensor], List[str], List[str]]]) – Input expression data. See details :expr_input is either a pd.DataFrame or a three-element tuple. When it is pd.DataFrame, its index and column should indicate the cell name and feature name of the dataset; when it is a three-element tuple, it should be in the form of Tuple[Union[np.array, torch.Tensor], List[str], List[str]] and its first element should be the actual dataset as either np.array or torch.tensor, the second element should be a list containing the name of the columns or the names of features, the third element should be a list containing the name of the indices or the names of the cells.:
marker_dict (Dict[str, List[str]]) – Marker dictionary containing cell type and information. See details :The dictionary maps the name of cell type/state to protein features. :
design (Union[DataFrame, array, None]) – A design matrix
include_other_column (bool) – Should an additional ‘other’ column be included?
dtype (dtype) – torch datatype of the model

Methods:

`get_cell_names`()	Get the cell names.
`get_classes`()	Get the cell types/states.
`get_design`()	Get the design matrix.
`get_dtype`()	Get the dtype of the SCDataset.
`get_exprs`()	Return the expression data as a `torch.Tensor`.
`get_exprs_df`()	Return the expression data as a `pandas.DataFrame`.
`get_features`()	Get the features (proteins).
`get_marker_mat`()	Return the marker matrix as a `torch.Tensor`.
`get_mu`()	Get the mean expression of each protein as a `torch.Tensor`.
`get_mu_init`([n_putative_cells])	Intelligent initialization for mu parameters
`get_n_cells`()	Get the number of cells: either the number of cell types or cell states.
`get_n_classes`()	Get the number of ‘classes’: either the number of cell types or cell states.
`get_n_features`()	Get the number of features (proteins).
`get_sigma`()	Get the standard deviation of each protein
`normalize`([percentile_lower, …])	Normalize the expression data
`rescale`()	Normalize the expression data.

get_cell_names()[source]¶

Get the cell names.

Returns: return self._cell_names
Return type: List[str]

get_classes()[source]¶

Get the cell types/states.

Returns: return self._classes
Return type: List[str]

get_design()[source]¶

Get the design matrix.

Returns: return self._design
Return type: torch.Tensor

get_dtype()[source]¶

Get the dtype of the SCDataset.

Returns: self._dtype
Return type: torch.dtype

get_exprs()[source]¶

Return the expression data as a torch.Tensor.

Return type: Tensor

get_exprs_df()[source]¶

Return the expression data as a pandas.DataFrame.

Return type: DataFrame

get_features()[source]¶

Get the features (proteins).

Returns: return self._m_features
Return type: List[str]

get_marker_mat()[source]¶

Return the marker matrix as a torch.Tensor.

Return type: Tensor

get_mu()[source]¶

Get the mean expression of each protein as a torch.Tensor.

Return type: Tensor

get_mu_init(n_putative_cells=10)[source]¶

Intelligent initialization for mu parameters

See manuscript for details

Parameters: n_putative_cells (int) – Number of cells to guess as given cell type
Return type: ndarray

get_n_cells()[source]¶

Get the number of cells: either the number of cell types or cell states.

Return type: int

get_n_classes()[source]¶

Get the number of ‘classes’: either the number of cell types or cell states.

Return type: int

get_n_features()[source]¶

Get the number of features (proteins).

Return type: int

get_sigma()[source]¶

Get the standard deviation of each protein

Return type: Tensor
Returns: standard deviation of each protein

normalize(percentile_lower=0, percentile_upper=99.9, cofactor=5.0)[source]¶

Normalize the expression data

This performs a two-step normalization: 1. A log(1+x) transformation to the data 2. Winsorizes to (percentile_lower, percentile_upper)

Parameters

percentile_lower (float) – the lower bound percentile for winsorization, defaults to 0
percentil_upper – the upper bound percentile for winsorization, defaults to 99.9
cofactor (float) – a cofactor constant, defaults to 5.0

Return type

None

rescale()[source]¶

Normalize the expression data.

Return type: None

astir.data.from_anndata_yaml(anndata_file, marker_yaml, protein_name=None, cell_name=None, batch_name='batch', create_design_mat=True, random_seed=1234, dtype=torch.float64)[source]¶

Create an Astir object from an anndata.Anndata file and a: marker yaml

Parameters

anndata_file (str) – Path to an anndata.Anndata h5py file
marker_yaml (str) – Path to input YAML file containing marker gene information. Should include cell_type and cell_state entries. See documention.
protein_name (Optional[str]) – The column of adata.var containing protein names. If this is none, defaults to adata.var_names
cell_name (Optional[str]) – The column of adata.obs containing cell names. If this is none, defaults to adata.obs_names
batch_name (str) – The column of adata.obs containing batch names. A design matrix will be built using this (if present) using a one-hot encoding to control for batch, defaults to ‘batch’
create_design_mat (bool) – Determines whether a design matrix is created. Defaults to True.
random_seed (int) – The random seed to be used to initialize variables, defaults to 1234
dtype (dtype) – datatype of the model parameters, defaults to torch.float64

Return type

Any

Returns

An object of class astir_bash.py.Astir using data imported from the loom files

astir.data.from_csv_dir_yaml(input_dir, marker_yaml, create_design_mat=True, random_seed=1234, dtype=torch.float64)[source]¶

Create an Astir object a directory containing multiple csv files

Parameters

input_dir (str) – Path to a directory containing multiple CSV files, each in the format expected by from_csv_yaml
marker_yaml (str) – Path to input YAML file containing marker gene information. Should include cell_type and cell_state entries. See documention.
design_csv – Path to design matrix as a CSV. Rows should be cells, and columns covariates. First column is cell identifier, and additional column names are covariate identifiers
create_design_mat (bool) – Determines whether a design matrix is created. Defaults to True.
random_seed (int) – The random seed to be used to initialize variables, defaults to 1234
dtype (dtype) – datatype of the model parameters, defaults to torch.float64

Return type

Any

astir.data.from_csv_yaml(csv_input, marker_yaml, design_csv=None, create_design_mat=True, random_seed=1234, dtype=torch.float64)[source]¶

Create an Astir object from an expression CSV and marker YAML

Parameters

csv_input (str) – Path to input csv containing expression for cells (rows) by proteins (columns). First column is cell identifier, and additional column names are gene identifiers.
marker_yaml (str) – Path to input YAML file containing marker gene information. Should include cell_type and cell_state entries. See documention.
design_csv (Optional[str]) – Path to design matrix as a CSV. Rows should be cells, and columns covariates. First column is cell identifier, and additional column names are covariate identifiers.
create_design_mat (bool) – Determines whether a design matrix is created. Defaults to True.
random_seed (int) – The random seed to be used to initialize variables, defaults to 1234
dtype (dtype) – datatype of the model parameters, defaults to torch.float64

Return type

Any

astir.data.from_loompy_yaml(loom_file, marker_yaml, protein_name_attr='protein', cell_name_attr='cell_name', batch_name_attr='batch', create_design_mat=True, random_seed=1234, dtype=torch.float64)[source]¶

Create an Astir object from a loom file and a marker yaml

Parameters

loom_file (str) – Path to a loom file, where rows correspond to proteins and columns to cells
marker_yaml (str) – Path to input YAML file containing marker gene information. Should include cell_type and cell_state entries. See documention.
protein_name_attr (str) – The attribute (key) in the row attributes that identifies the protein names (required to match with the marker gene information), defaults to protein
cell_name_attr (str) – The attribute (key) in the column attributes that identifies the name of each cell, defaults to cell_name
batch_name_attr (str) – The attribute (key) in the column attributes that identifies the batch. A design matrix will be built using this (if present) using a one-hot encoding to control for batch, defaults to batch
create_design_mat (bool) – Determines whether a design matrix is created. Defaults to True.
random_seed (int) – The random seed to be used to initialize variables, defaults to 1234
dtype (dtype) – datatype of the model parameters, defaults to torch.float64

Return type

Any

Returns

An object of class astir_bash.py.Astir using data imported from the loom files