astir package

Module contents

class astir.Astir(input_expr=None, marker_dict=None, design=None, random_seed=1234, dtype=torch.float64)[source]

Bases: object

Create an Astir object

Parameters
  • input_expr (Union[pd.DataFrame, Tuple[np.array, List[str], List[str]], Tuple[SCDataset, SCDataset]]) – the single cell protein expression dataset

  • marker_dict (Dict[str, Dict[str, str]], optional) – the marker dictionary which maps cell type/state to protein features, defaults to None

  • design (Union[pd.DataFrame, np.array], optional) – the design matrix labeling the grouping of cell, defaults to None

  • random_seed (int, optional) – random seed for parameter initialization, defaults to 1234

  • dtype (torch.dtype, optional) – dtype of data, defaults to torch.float64

Raises

NotClassifiableError – raised if the model is not trainable

Methods:

assign_celltype_hierarchy([depth])

Get cell type assignment at a specified higher hierarchy according to the hierarchy provided

diagnostics_cellstate()

Run diagnostics on cell state assignments

diagnostics_celltype([threshold, alpha])

Run diagnostics on cell type assignments

fit_state([max_epochs, learning_rate, …])

Run Variational Bayes to infer cell states

fit_type([max_epochs, learning_rate, …])

Run Variational Bayes to infer cell types

get_cellstates()

Get cell state activations.

get_celltype_probabilities()

Get the cell assignment probability.

get_celltypes([threshold, assignment_type])

Get the most likely cell type

get_hierarchy_dict()

Get the dictionary for cell type hierarchical structure.

get_state_dataset()

Get the SCDataset for cell state training.

get_state_losses()

Getter for losses

get_state_model()

Get the trained CellStateModel.

get_state_run_info()

Get the run information (i.e. max_epochs, learning_rate, batch_size,

get_type_dataset()

Get the SCDataset for cell type training.

get_type_losses()

Get the final losses of the type model.

get_type_model()

Get the trained CellTypeModel.

get_type_run_info()

Get the run information (i.e. max_epochs, learning_rate,

load_model(hdf5_name)

Load model from hdf5 file

normalize([percentile_lower, percentile_upper])

Normalize the expression data

predict_cellstates([dset])

Get the prediction cell state activations on a dataset on an existing model

predict_celltypes([dset])

Predict the probabilities of different cell type assignments.

save_models([hdf5_name])

Save the summary of this model to an hdf5 file.

state_to_csv(output_csv)

Writes state assignment output from training state model in csv file

type_clustermap([plot_name, threshold, …])

Save the heatmap of protein content in cells with cell types labeled.

type_to_csv(output_csv[, threshold, …])

Save the cell type assignemnt to a csv file.

assign_celltype_hierarchy(depth=1)[source]
Get cell type assignment at a specified higher hierarchy according to the hierarchy provided

in the dictionary.

Parameters

depth (int, optional) – the depth of hierarchy to assign probability to, defaults to 1

Raises

Exception – raised when the dictionary for hierarchical structure is not provided or the model hasn’t been trained.

Returns

probability assignment of cell type at a superstructure

Return type

pd.DataFrame

diagnostics_cellstate()[source]

Run diagnostics on cell state assignments

This performs a basic test by comparing the correlation values between all marker genes and all non marker genes. It detects where the non marker gene has higher correlation values than the smallest correlation values of marker genes.

  1. Get correlations between all cell states and proteins

  2. For each cell state c, get the smallest correlation with marker g

  3. For each cell state c and its non marker g, find any correlation that is

    bigger than those smallest correlation for c.

  4. Any c and g pairs found in step 3 will be included in the output of

    Astir.diagnostics_cellstate(), including an explanation.

Return type

DataFrame

Returns

diagnostics

diagnostics_celltype(threshold=0.5, alpha=0.01)[source]

Run diagnostics on cell type assignments

This performs a basic test that cell types express their markers at higher levels than in other cell types. This function performs the following steps:

  1. Iterates through every cell type and every marker for that cell type

  2. Given a cell type c and marker g, find the set of cell

    types D that don’t have g as a marker

  3. For each cell type d in D, perform a t-test between the

    expression of marker g in c vs d

  4. If g is not expressed significantly higher (at significance

    alpha), output a diagnostic explaining this for further investigation.

Parameters
  • threshold (float) – The threshold at which cell types are assigned (see get_celltypes)

  • alpha (float) – The significance threshold for t-tests for determining over-expression

Return type

DataFrame

Returns

Either a pd.DataFrame listing cell types whose markers aren’t expressed signficantly higher.

fit_state(max_epochs=50, learning_rate=0.001, batch_size=128, delta_loss=0.001, n_init=5, n_init_epochs=5, delta_loss_batch=10, const=2, dropout_rate=0, batch_norm=False)[source]

Run Variational Bayes to infer cell states

Parameters
  • max_epochs (int) – number of epochs, defaults to 100

  • learning_rate (float) – the learning rate, defaults to 1e-2

  • n_init (int) – the number of initial parameters to compare, defaults to 5

  • delta_loss (float) – stops iteration once the loss rate reaches delta_loss, defaults to 0.001

  • delta_loss_batch (int) – the batch size to consider delta loss, defaults to 10

Return type

None

fit_type(max_epochs=50, learning_rate=0.001, batch_size=128, delta_loss=0.001, n_init=5, n_init_epochs=5)[source]

Run Variational Bayes to infer cell types

Parameters
  • max_epochs (int) – maximum number of epochs to train

  • learning_rate (float) – ADAM optimizer learning rate

  • batch_size (int) – minibatch size

  • delta_loss (float) – stops iteration once the loss rate reaches delta_loss, defaults to 0.001

  • n_inits – number of random initializations

Return type

None

get_cellstates()[source]

Get cell state activations. It returns the rescaled activations, values between 0 and 1

Returns

state assignments

Return type

pd.DataFrame

get_celltype_probabilities()[source]

Get the cell assignment probability.

Returns

self.assignments

Return type

pd.DataFrame

get_celltypes(threshold=0.7, assignment_type='threshold')[source]

Get the most likely cell type

A cell is assigned to a cell type if the probability is greater than threshold. If no cell types have a probability higher than threshold, then “Unknown” is returned

Parameters
  • threshold (float) – the probability threshold above which a cell is assigned to a cell type

  • assignment_type (str) – See astir.CellTypeModel.get_celltypes() for full documentation

Return type

DataFrame

Returns

a data frame with most likely cell types for each

get_hierarchy_dict()[source]

Get the dictionary for cell type hierarchical structure.

Returns

self._hierarchy_dict

Return type

Dict[str, List[str]]

get_state_dataset()[source]

Get the SCDataset for cell state training.

Return type

SCDataset

Returns

self._state_dset

get_state_losses()[source]

Getter for losses

Returns

a numpy array of losses for each training iteration the model runs

Return type

np.array

get_state_model()[source]

Get the trained CellStateModel.

Raises

Exception – raised when this function is celled before the model is trained.

Return type

CellStateModel

Returns

self._state_ast

get_state_run_info()[source]
Get the run information (i.e. max_epochs, learning_rate, batch_size,

delta_loss, n_init, n_init_epochs, delta_loss_batch) of the cell state training.

Raises

Exception – raised when this function is celled before the model is trained.

Return type

Dict[str, Union[int, float]]

Returns

self._state_run_info

get_type_dataset()[source]

Get the SCDataset for cell type training.

Return type

SCDataset

Returns

self._type_dset

get_type_losses()[source]

Get the final losses of the type model.

Returns

self.losses

Return type

np.array

get_type_model()[source]

Get the trained CellTypeModel.

Raises

Exception – raised when this function is called before the model is trained.

Return type

CellTypeModel

Returns

self._type_ast

get_type_run_info()[source]
Get the run information (i.e. max_epochs, learning_rate,

batch_size, delta_loss, n_init, n_init_epochs) of the cell type training.

Raises

Exception – raised when this function is celled before the model is trained.

Return type

Dict[str, Union[int, float]]

Returns

self._type_run_info

load_model(hdf5_name)[source]

Load model from hdf5 file

Parameters

hdf5_name (str) – the full path to file

Return type

None

normalize(percentile_lower=1, percentile_upper=99)[source]

Normalize the expression data

This performs a two-step normalization: 1. A log(1+x) transformation to the data 2. Winsorizes to (percentile_lower, percentile_upper)

Parameters
  • percentile_lower (int) – Lower percentile for winsorization

  • percentile_upper (int) – Upper percentile for winsorization

Return type

None

predict_cellstates(dset=None)[source]

Get the prediction cell state activations on a dataset on an existing model

Parameters

new_dset – the dataset to predict cell state activations, default to None

Return type

DataFrame

Returns

the prediction of cell state activations

predict_celltypes(dset=None)[source]

Predict the probabilities of different cell type assignments.

Parameters

dset (pd.DataFrame, optional) – the single cell protein expression dataset to predict, defaults to None

Raises

Exception – when the type model is not trained when this function is called

Returns

the probabilities of different cell type assignments

Return type

pd.DataFrame

save_models(hdf5_name='astir_summary.hdf5')[source]

Save the summary of this model to an hdf5 file.

Parameters

hdf5_name (str) – name of the output hdf5 file, default to “astir_summary.hdf5”

Raises

Exception – raised when this function is called before the model is trained.

Return type

None

state_to_csv(output_csv)[source]

Writes state assignment output from training state model in csv file

Parameters

output_csv (str, required) – path to output csv

Return type

None

type_clustermap(plot_name='celltype_protein_cluster.png', threshold=0.7, figsize=(7, 5), prob_assign=None)[source]

Save the heatmap of protein content in cells with cell types labeled.

Parameters
  • plot_name (str, optional) – name of the plot, extension(e.g. .png or .jpg) is needed, defaults to “celltype_protein_cluster.png”

  • threshold (float, optional) – the probability threshold above which a cell is assigned to a cell type, defaults to 0.7

Return type

None

type_to_csv(output_csv, threshold=0.7, assignment_type='threshold')[source]

Save the cell type assignemnt to a csv file.

Parameters
  • output_csv (str) – name for the output .csv file

  • assignment_type (str) – See astir.CellTypeModel.get_celltypes() for full documentation

Return type

None