astir package¶

Module contents¶

class astir.Astir(input_expr=None, marker_dict=None, design=None, random_seed=1234, dtype=torch.float64)[source]¶

Bases: object

Create an Astir object

Parameters

input_expr (Union[pd.DataFrame, Tuple[np.array, List[str], List[str]], Tuple[SCDataset, SCDataset]]) – the single cell protein expression dataset
marker_dict (Dict[str, Dict[str, str]], optional) – the marker dictionary which maps cell type/state to protein features, defaults to None
design (Union[pd.DataFrame, np.array], optional) – the design matrix labeling the grouping of cell, defaults to None
random_seed (int, optional) – random seed for parameter initialization, defaults to 1234
dtype (torch.dtype, optional) – dtype of data, defaults to torch.float64

Raises

NotClassifiableError – raised if the model is not trainable

Methods:

`assign_celltype_hierarchy`([depth])	Get cell type assignment at a specified higher hierarchy according to the hierarchy provided
`diagnostics_cellstate`()	Run diagnostics on cell state assignments
`diagnostics_celltype`([threshold, alpha])	Run diagnostics on cell type assignments
`fit_state`([max_epochs, learning_rate, …])	Run Variational Bayes to infer cell states
`fit_type`([max_epochs, learning_rate, …])	Run Variational Bayes to infer cell types
`get_cellstates`()	Get cell state activations.
`get_celltype_probabilities`()	Get the cell assignment probability.
`get_celltypes`([threshold, assignment_type])	Get the most likely cell type
`get_hierarchy_dict`()	Get the dictionary for cell type hierarchical structure.
`get_state_dataset`()	Get the SCDataset for cell state training.
`get_state_losses`()	Getter for losses
`get_state_model`()	Get the trained CellStateModel.
`get_state_run_info`()	Get the run information (i.e. max_epochs, learning_rate, batch_size,
`get_type_dataset`()	Get the SCDataset for cell type training.
`get_type_losses`()	Get the final losses of the type model.
`get_type_model`()	Get the trained CellTypeModel.
`get_type_run_info`()	Get the run information (i.e. max_epochs, learning_rate,
`load_model`(hdf5_name)	Load model from hdf5 file
`normalize`([percentile_lower, percentile_upper])	Normalize the expression data
`predict_cellstates`([dset])	Get the prediction cell state activations on a dataset on an existing model
`predict_celltypes`([dset])	Predict the probabilities of different cell type assignments.
`save_models`([hdf5_name])	Save the summary of this model to an hdf5 file.
`state_to_csv`(output_csv)	Writes state assignment output from training state model in csv file
`type_clustermap`([plot_name, threshold, …])	Save the heatmap of protein content in cells with cell types labeled.
`type_to_csv`(output_csv[, threshold, …])	Save the cell type assignemnt to a csv file.

assign_celltype_hierarchy(depth=1)[source]¶

Get cell type assignment at a specified higher hierarchy according to the hierarchy provided: in the dictionary.

Parameters: depth (int, optional) – the depth of hierarchy to assign probability to, defaults to 1
Raises: Exception – raised when the dictionary for hierarchical structure is not provided or the model hasn’t been trained.
Returns: probability assignment of cell type at a superstructure
Return type: pd.DataFrame

diagnostics_cellstate()[source]¶

Run diagnostics on cell state assignments

This performs a basic test by comparing the correlation values between all marker genes and all non marker genes. It detects where the non marker gene has higher correlation values than the smallest correlation values of marker genes.

Get correlations between all cell states and proteins
For each cell state c, get the smallest correlation with marker g
For each cell state c and its non marker g, find any correlation that is
bigger than those smallest correlation for c.
Any c and g pairs found in step 3 will be included in the output of
Astir.diagnostics_cellstate(), including an explanation.

Return type: DataFrame
Returns: diagnostics

diagnostics_celltype(threshold=0.5, alpha=0.01)[source]¶

Run diagnostics on cell type assignments

This performs a basic test that cell types express their markers at higher levels than in other cell types. This function performs the following steps:

Iterates through every cell type and every marker for that cell type
Given a cell type c and marker g, find the set of cell
types D that don’t have g as a marker
For each cell type d in D, perform a t-test between the
expression of marker g in c vs d
If g is not expressed significantly higher (at significance
alpha), output a diagnostic explaining this for further investigation.

Parameters

threshold (float) – The threshold at which cell types are assigned (see get_celltypes)
alpha (float) – The significance threshold for t-tests for determining over-expression

Return type

DataFrame

Returns

Either a pd.DataFrame listing cell types whose markers aren’t expressed signficantly higher.

fit_state(max_epochs=50, learning_rate=0.001, batch_size=128, delta_loss=0.001, n_init=5, n_init_epochs=5, delta_loss_batch=10, const=2, dropout_rate=0, batch_norm=False)[source]¶

Run Variational Bayes to infer cell states

Parameters

max_epochs (int) – number of epochs, defaults to 100
learning_rate (float) – the learning rate, defaults to 1e-2
n_init (int) – the number of initial parameters to compare, defaults to 5
delta_loss (float) – stops iteration once the loss rate reaches delta_loss, defaults to 0.001
delta_loss_batch (int) – the batch size to consider delta loss, defaults to 10

Return type

None

fit_type(max_epochs=50, learning_rate=0.001, batch_size=128, delta_loss=0.001, n_init=5, n_init_epochs=5)[source]¶

Run Variational Bayes to infer cell types

Parameters

max_epochs (int) – maximum number of epochs to train
learning_rate (float) – ADAM optimizer learning rate
batch_size (int) – minibatch size
delta_loss (float) – stops iteration once the loss rate reaches delta_loss, defaults to 0.001
n_inits – number of random initializations

Return type

None

get_cellstates()[source]¶

Get cell state activations. It returns the rescaled activations, values between 0 and 1

Returns: state assignments
Return type: pd.DataFrame

get_celltype_probabilities()[source]¶

Get the cell assignment probability.

Returns: self.assignments
Return type: pd.DataFrame

get_celltypes(threshold=0.7, assignment_type='threshold')[source]¶

Get the most likely cell type

A cell is assigned to a cell type if the probability is greater than threshold. If no cell types have a probability higher than threshold, then “Unknown” is returned

Parameters

threshold (float) – the probability threshold above which a cell is assigned to a cell type
assignment_type (str) – See astir.CellTypeModel.get_celltypes() for full documentation

Return type

DataFrame

Returns

a data frame with most likely cell types for each

get_hierarchy_dict()[source]¶

Get the dictionary for cell type hierarchical structure.

Returns: self._hierarchy_dict
Return type: Dict[str, List[str]]

get_state_dataset()[source]¶

Get the SCDataset for cell state training.

Return type: SCDataset
Returns: self._state_dset

get_state_losses()[source]¶

Getter for losses

Returns: a numpy array of losses for each training iteration the model runs
Return type: np.array

get_state_model()[source]¶

Get the trained CellStateModel.

Raises: Exception – raised when this function is celled before the model is trained.
Return type: CellStateModel
Returns: self._state_ast

get_state_run_info()[source]¶

Get the run information (i.e. max_epochs, learning_rate, batch_size,: delta_loss, n_init, n_init_epochs, delta_loss_batch) of the cell state training.

Raises: Exception – raised when this function is celled before the model is trained.
Return type: Dict[str, Union[int, float]]
Returns: self._state_run_info

get_type_dataset()[source]¶

Get the SCDataset for cell type training.

Return type: SCDataset
Returns: self._type_dset

get_type_losses()[source]¶

Get the final losses of the type model.

Returns: self.losses
Return type: np.array

get_type_model()[source]¶

Get the trained CellTypeModel.

Raises: Exception – raised when this function is called before the model is trained.
Return type: CellTypeModel
Returns: self._type_ast

get_type_run_info()[source]¶

Get the run information (i.e. max_epochs, learning_rate,: batch_size, delta_loss, n_init, n_init_epochs) of the cell type training.

Raises: Exception – raised when this function is celled before the model is trained.
Return type: Dict[str, Union[int, float]]
Returns: self._type_run_info

load_model(hdf5_name)[source]¶

Load model from hdf5 file

Parameters: hdf5_name (str) – the full path to file
Return type: None

normalize(percentile_lower=1, percentile_upper=99)[source]¶

Normalize the expression data

This performs a two-step normalization: 1. A log(1+x) transformation to the data 2. Winsorizes to (percentile_lower, percentile_upper)

Parameters

percentile_lower (int) – Lower percentile for winsorization
percentile_upper (int) – Upper percentile for winsorization

Return type

None

predict_cellstates(dset=None)[source]¶

Get the prediction cell state activations on a dataset on an existing model

Parameters: new_dset – the dataset to predict cell state activations, default to None
Return type: DataFrame
Returns: the prediction of cell state activations

predict_celltypes(dset=None)[source]¶

Predict the probabilities of different cell type assignments.

Parameters: dset (pd.DataFrame, optional) – the single cell protein expression dataset to predict, defaults to None
Raises: Exception – when the type model is not trained when this function is called
Returns: the probabilities of different cell type assignments
Return type: pd.DataFrame

save_models(hdf5_name='astir_summary.hdf5')[source]¶

Save the summary of this model to an hdf5 file.

Parameters: hdf5_name (str) – name of the output hdf5 file, default to “astir_summary.hdf5”
Raises: Exception – raised when this function is called before the model is trained.
Return type: None

state_to_csv(output_csv)[source]¶

Writes state assignment output from training state model in csv file

Parameters: output_csv (str, required) – path to output csv
Return type: None

type_clustermap(plot_name='celltype_protein_cluster.png', threshold=0.7, figsize=(7, 5), prob_assign=None)[source]¶

Save the heatmap of protein content in cells with cell types labeled.

Parameters

plot_name (str, optional) – name of the plot, extension(e.g. .png or .jpg) is needed, defaults to “celltype_protein_cluster.png”
threshold (float, optional) – the probability threshold above which a cell is assigned to a cell type, defaults to 0.7

Return type

None

type_to_csv(output_csv, threshold=0.7, assignment_type='threshold')[source]¶

Save the cell type assignemnt to a csv file.

Parameters

output_csv (str) – name for the output .csv file
assignment_type (str) – See astir.CellTypeModel.get_celltypes() for full documentation

Return type

None