astir package¶
Module contents¶
- class astir.Astir(input_expr=None, marker_dict=None, design=None, random_seed=1234, dtype=torch.float64)[source]¶
Bases:
object
Create an Astir object
- Parameters
input_expr (Union[pd.DataFrame, Tuple[np.array, List[str], List[str]], Tuple[SCDataset, SCDataset]]) – the single cell protein expression dataset
marker_dict (Dict[str, Dict[str, str]], optional) – the marker dictionary which maps cell type/state to protein features, defaults to None
design (Union[pd.DataFrame, np.array], optional) – the design matrix labeling the grouping of cell, defaults to None
random_seed (int, optional) – random seed for parameter initialization, defaults to 1234
dtype (torch.dtype, optional) – dtype of data, defaults to torch.float64
- Raises
NotClassifiableError – raised if the model is not trainable
Methods:
assign_celltype_hierarchy
([depth])Get cell type assignment at a specified higher hierarchy according to the hierarchy provided
diagnostics_cellstate
()Run diagnostics on cell state assignments
diagnostics_celltype
([threshold, alpha])Run diagnostics on cell type assignments
fit_state
([max_epochs, learning_rate, …])Run Variational Bayes to infer cell states
fit_type
([max_epochs, learning_rate, …])Run Variational Bayes to infer cell types
get_cellstates
()Get cell state activations.
get_celltype_probabilities
()Get the cell assignment probability.
get_celltypes
([threshold, assignment_type])Get the most likely cell type
get_hierarchy_dict
()Get the dictionary for cell type hierarchical structure.
get_state_dataset
()Get the SCDataset for cell state training.
get_state_losses
()Getter for losses
get_state_model
()Get the trained CellStateModel.
get_state_run_info
()Get the run information (i.e. max_epochs, learning_rate, batch_size,
get_type_dataset
()Get the SCDataset for cell type training.
get_type_losses
()Get the final losses of the type model.
get_type_model
()Get the trained CellTypeModel.
get_type_run_info
()Get the run information (i.e. max_epochs, learning_rate,
load_model
(hdf5_name)Load model from hdf5 file
normalize
([percentile_lower, percentile_upper])Normalize the expression data
predict_cellstates
([dset])Get the prediction cell state activations on a dataset on an existing model
predict_celltypes
([dset])Predict the probabilities of different cell type assignments.
save_models
([hdf5_name])Save the summary of this model to an hdf5 file.
state_to_csv
(output_csv)Writes state assignment output from training state model in csv file
type_clustermap
([plot_name, threshold, …])Save the heatmap of protein content in cells with cell types labeled.
type_to_csv
(output_csv[, threshold, …])Save the cell type assignemnt to a csv file.
- assign_celltype_hierarchy(depth=1)[source]¶
- Get cell type assignment at a specified higher hierarchy according to the hierarchy provided
in the dictionary.
- Parameters
depth (int, optional) – the depth of hierarchy to assign probability to, defaults to 1
- Raises
Exception – raised when the dictionary for hierarchical structure is not provided or the model hasn’t been trained.
- Returns
probability assignment of cell type at a superstructure
- Return type
pd.DataFrame
- diagnostics_cellstate()[source]¶
Run diagnostics on cell state assignments
This performs a basic test by comparing the correlation values between all marker genes and all non marker genes. It detects where the non marker gene has higher correlation values than the smallest correlation values of marker genes.
Get correlations between all cell states and proteins
For each cell state c, get the smallest correlation with marker g
- For each cell state c and its non marker g, find any correlation that is
bigger than those smallest correlation for c.
- Any c and g pairs found in step 3 will be included in the output of
Astir.diagnostics_cellstate(), including an explanation.
- Return type
DataFrame
- Returns
diagnostics
- diagnostics_celltype(threshold=0.5, alpha=0.01)[source]¶
Run diagnostics on cell type assignments
This performs a basic test that cell types express their markers at higher levels than in other cell types. This function performs the following steps:
Iterates through every cell type and every marker for that cell type
- Given a cell type c and marker g, find the set of cell
types D that don’t have g as a marker
- For each cell type d in D, perform a t-test between the
expression of marker g in c vs d
- If g is not expressed significantly higher (at significance
alpha), output a diagnostic explaining this for further investigation.
- Parameters
threshold (
float
) – The threshold at which cell types are assigned (see get_celltypes)alpha (
float
) – The significance threshold for t-tests for determining over-expression
- Return type
DataFrame
- Returns
Either a
pd.DataFrame
listing cell types whose markers aren’t expressed signficantly higher.
- fit_state(max_epochs=50, learning_rate=0.001, batch_size=128, delta_loss=0.001, n_init=5, n_init_epochs=5, delta_loss_batch=10, const=2, dropout_rate=0, batch_norm=False)[source]¶
Run Variational Bayes to infer cell states
- Parameters
max_epochs (
int
) – number of epochs, defaults to 100learning_rate (
float
) – the learning rate, defaults to 1e-2n_init (
int
) – the number of initial parameters to compare, defaults to 5delta_loss (
float
) – stops iteration once the loss rate reaches delta_loss, defaults to 0.001delta_loss_batch (
int
) – the batch size to consider delta loss, defaults to 10
- Return type
None
- fit_type(max_epochs=50, learning_rate=0.001, batch_size=128, delta_loss=0.001, n_init=5, n_init_epochs=5)[source]¶
Run Variational Bayes to infer cell types
- Parameters
max_epochs (
int
) – maximum number of epochs to trainlearning_rate (
float
) – ADAM optimizer learning ratebatch_size (
int
) – minibatch sizedelta_loss (
float
) – stops iteration once the loss rate reaches delta_loss, defaults to 0.001n_inits – number of random initializations
- Return type
None
- get_cellstates()[source]¶
Get cell state activations. It returns the rescaled activations, values between 0 and 1
- Returns
state assignments
- Return type
pd.DataFrame
- get_celltype_probabilities()[source]¶
Get the cell assignment probability.
- Returns
self.assignments
- Return type
pd.DataFrame
- get_celltypes(threshold=0.7, assignment_type='threshold')[source]¶
Get the most likely cell type
A cell is assigned to a cell type if the probability is greater than threshold. If no cell types have a probability higher than threshold, then “Unknown” is returned
- Parameters
threshold (
float
) – the probability threshold above which a cell is assigned to a cell typeassignment_type (
str
) – Seeastir.CellTypeModel.get_celltypes()
for full documentation
- Return type
DataFrame
- Returns
a data frame with most likely cell types for each
- get_hierarchy_dict()[source]¶
Get the dictionary for cell type hierarchical structure.
- Returns
self._hierarchy_dict
- Return type
Dict[str, List[str]]
- get_state_dataset()[source]¶
Get the SCDataset for cell state training.
- Return type
- Returns
self._state_dset
- get_state_losses()[source]¶
Getter for losses
- Returns
a numpy array of losses for each training iteration the model runs
- Return type
np.array
- get_state_model()[source]¶
Get the trained CellStateModel.
- Raises
Exception – raised when this function is celled before the model is trained.
- Return type
- Returns
self._state_ast
- get_state_run_info()[source]¶
- Get the run information (i.e. max_epochs, learning_rate, batch_size,
delta_loss, n_init, n_init_epochs, delta_loss_batch) of the cell state training.
- Raises
Exception – raised when this function is celled before the model is trained.
- Return type
Dict
[str
,Union
[int
,float
]]- Returns
self._state_run_info
- get_type_dataset()[source]¶
Get the SCDataset for cell type training.
- Return type
- Returns
self._type_dset
- get_type_losses()[source]¶
Get the final losses of the type model.
- Returns
self.losses
- Return type
np.array
- get_type_model()[source]¶
Get the trained CellTypeModel.
- Raises
Exception – raised when this function is called before the model is trained.
- Return type
- Returns
self._type_ast
- get_type_run_info()[source]¶
- Get the run information (i.e. max_epochs, learning_rate,
batch_size, delta_loss, n_init, n_init_epochs) of the cell type training.
- Raises
Exception – raised when this function is celled before the model is trained.
- Return type
Dict
[str
,Union
[int
,float
]]- Returns
self._type_run_info
- load_model(hdf5_name)[source]¶
Load model from hdf5 file
- Parameters
hdf5_name (
str
) – the full path to file- Return type
None
- normalize(percentile_lower=1, percentile_upper=99)[source]¶
Normalize the expression data
This performs a two-step normalization: 1. A log(1+x) transformation to the data 2. Winsorizes to (percentile_lower, percentile_upper)
- Parameters
percentile_lower (
int
) – Lower percentile for winsorizationpercentile_upper (
int
) – Upper percentile for winsorization
- Return type
None
- predict_cellstates(dset=None)[source]¶
Get the prediction cell state activations on a dataset on an existing model
- Parameters
new_dset – the dataset to predict cell state activations, default to None
- Return type
DataFrame
- Returns
the prediction of cell state activations
- predict_celltypes(dset=None)[source]¶
Predict the probabilities of different cell type assignments.
- Parameters
dset (pd.DataFrame, optional) – the single cell protein expression dataset to predict, defaults to None
- Raises
Exception – when the type model is not trained when this function is called
- Returns
the probabilities of different cell type assignments
- Return type
pd.DataFrame
- save_models(hdf5_name='astir_summary.hdf5')[source]¶
Save the summary of this model to an hdf5 file.
- Parameters
hdf5_name (
str
) – name of the output hdf5 file, default to “astir_summary.hdf5”- Raises
Exception – raised when this function is called before the model is trained.
- Return type
None
- state_to_csv(output_csv)[source]¶
Writes state assignment output from training state model in csv file
- Parameters
output_csv (str, required) – path to output csv
- Return type
None
- type_clustermap(plot_name='celltype_protein_cluster.png', threshold=0.7, figsize=(7, 5), prob_assign=None)[source]¶
Save the heatmap of protein content in cells with cell types labeled.
- Parameters
plot_name (str, optional) – name of the plot, extension(e.g. .png or .jpg) is needed, defaults to “celltype_protein_cluster.png”
threshold (float, optional) – the probability threshold above which a cell is assigned to a cell type, defaults to 0.7
- Return type
None