astir.data package¶
Module contents¶
Classes:
|
Container for single-cell proteomic data in the form of a pytorch dataset |
Functions:
|
Create an Astir object from an |
|
Create an Astir object a directory containing multiple csv files |
|
Create an Astir object from an expression CSV and marker YAML |
|
Create an Astir object from a loom file and a marker yaml |
- class astir.data.SCDataset(expr_input, marker_dict, include_other_column, design=None, dtype=torch.float64, device=device(type='cpu'))[source]¶
Bases:
Generic
[torch.utils.data.dataset.T_co
]Container for single-cell proteomic data in the form of a pytorch dataset
- Parameters
expr_input (
Union
[DataFrame
,Tuple
[Union
[array
,Tensor
],List
[str
],List
[str
]]]) – Input expression data. See details :expr_input is either a pd.DataFrame or a three-element tuple. When it is pd.DataFrame, its index and column should indicate the cell name and feature name of the dataset; when it is a three-element tuple, it should be in the form of Tuple[Union[np.array, torch.Tensor], List[str], List[str]] and its first element should be the actual dataset as either np.array or torch.tensor, the second element should be a list containing the name of the columns or the names of features, the third element should be a list containing the name of the indices or the names of the cells.:marker_dict (
Dict
[str
,List
[str
]]) – Marker dictionary containing cell type and information. See details :The dictionary maps the name of cell type/state to protein features. :design (
Union
[DataFrame
,array
,None
]) – A design matrixinclude_other_column (
bool
) – Should an additional ‘other’ column be included?dtype (
dtype
) – torch datatype of the model
Methods:
Get the cell names.
Get the cell types/states.
Get the design matrix.
Get the dtype of the SCDataset.
Return the expression data as a
torch.Tensor
.Return the expression data as a
pandas.DataFrame
.Get the features (proteins).
Return the marker matrix as a
torch.Tensor
.get_mu
()Get the mean expression of each protein as a
torch.Tensor
.get_mu_init
([n_putative_cells])Intelligent initialization for mu parameters
Get the number of cells: either the number of cell types or cell states.
Get the number of ‘classes’: either the number of cell types or cell states.
Get the number of features (proteins).
Get the standard deviation of each protein
normalize
([percentile_lower, …])Normalize the expression data
rescale
()Normalize the expression data.
- get_classes()[source]¶
Get the cell types/states.
- Returns
return self._classes
- Return type
List[str]
- get_features()[source]¶
Get the features (proteins).
- Returns
return self._m_features
- Return type
List[str]
- get_mu_init(n_putative_cells=10)[source]¶
Intelligent initialization for mu parameters
See manuscript for details
- Parameters
n_putative_cells (
int
) – Number of cells to guess as given cell type- Return type
ndarray
- get_n_cells()[source]¶
Get the number of cells: either the number of cell types or cell states.
- Return type
int
- get_n_classes()[source]¶
Get the number of ‘classes’: either the number of cell types or cell states.
- Return type
int
- get_sigma()[source]¶
Get the standard deviation of each protein
- Return type
Tensor
- Returns
standard deviation of each protein
- normalize(percentile_lower=0, percentile_upper=99.9, cofactor=5.0)[source]¶
Normalize the expression data
This performs a two-step normalization: 1. A log(1+x) transformation to the data 2. Winsorizes to (percentile_lower, percentile_upper)
- Parameters
percentile_lower (
float
) – the lower bound percentile for winsorization, defaults to 0percentil_upper – the upper bound percentile for winsorization, defaults to 99.9
cofactor (
float
) – a cofactor constant, defaults to 5.0
- Return type
None
- astir.data.from_anndata_yaml(anndata_file, marker_yaml, protein_name=None, cell_name=None, batch_name='batch', create_design_mat=True, random_seed=1234, dtype=torch.float64)[source]¶
- Create an Astir object from an
anndata.Anndata
file and a marker yaml
- Parameters
anndata_file (
str
) – Path to ananndata.Anndata
h5py filemarker_yaml (
str
) – Path to input YAML file containing marker gene information. Should include cell_type and cell_state entries. See documention.protein_name (
Optional
[str
]) – The column of adata.var containing protein names. If this is none, defaults to adata.var_namescell_name (
Optional
[str
]) – The column of adata.obs containing cell names. If this is none, defaults to adata.obs_namesbatch_name (
str
) – The column of adata.obs containing batch names. A design matrix will be built using this (if present) using a one-hot encoding to control for batch, defaults to ‘batch’create_design_mat (
bool
) – Determines whether a design matrix is created. Defaults to True.random_seed (
int
) – The random seed to be used to initialize variables, defaults to 1234dtype (
dtype
) – datatype of the model parameters, defaults to torch.float64
- Return type
Any
- Returns
An object of class astir_bash.py.Astir using data imported from the loom files
- Create an Astir object from an
- astir.data.from_csv_dir_yaml(input_dir, marker_yaml, create_design_mat=True, random_seed=1234, dtype=torch.float64)[source]¶
Create an Astir object a directory containing multiple csv files
- Parameters
input_dir (
str
) – Path to a directory containing multiple CSV files, each in the format expected by from_csv_yamlmarker_yaml (
str
) – Path to input YAML file containing marker gene information. Should include cell_type and cell_state entries. See documention.design_csv – Path to design matrix as a CSV. Rows should be cells, and columns covariates. First column is cell identifier, and additional column names are covariate identifiers
create_design_mat (
bool
) – Determines whether a design matrix is created. Defaults to True.random_seed (
int
) – The random seed to be used to initialize variables, defaults to 1234dtype (
dtype
) – datatype of the model parameters, defaults to torch.float64
- Return type
Any
- astir.data.from_csv_yaml(csv_input, marker_yaml, design_csv=None, create_design_mat=True, random_seed=1234, dtype=torch.float64)[source]¶
Create an Astir object from an expression CSV and marker YAML
- Parameters
csv_input (
str
) – Path to input csv containing expression for cells (rows) by proteins (columns). First column is cell identifier, and additional column names are gene identifiers.marker_yaml (
str
) – Path to input YAML file containing marker gene information. Should include cell_type and cell_state entries. See documention.design_csv (
Optional
[str
]) – Path to design matrix as a CSV. Rows should be cells, and columns covariates. First column is cell identifier, and additional column names are covariate identifiers.create_design_mat (
bool
) – Determines whether a design matrix is created. Defaults to True.random_seed (
int
) – The random seed to be used to initialize variables, defaults to 1234dtype (
dtype
) – datatype of the model parameters, defaults to torch.float64
- Return type
Any
- astir.data.from_loompy_yaml(loom_file, marker_yaml, protein_name_attr='protein', cell_name_attr='cell_name', batch_name_attr='batch', create_design_mat=True, random_seed=1234, dtype=torch.float64)[source]¶
Create an Astir object from a loom file and a marker yaml
- Parameters
loom_file (
str
) – Path to a loom file, where rows correspond to proteins and columns to cellsmarker_yaml (
str
) – Path to input YAML file containing marker gene information. Should include cell_type and cell_state entries. See documention.protein_name_attr (
str
) – The attribute (key) in the row attributes that identifies the protein names (required to match with the marker gene information), defaults to proteincell_name_attr (
str
) – The attribute (key) in the column attributes that identifies the name of each cell, defaults to cell_namebatch_name_attr (
str
) – The attribute (key) in the column attributes that identifies the batch. A design matrix will be built using this (if present) using a one-hot encoding to control for batch, defaults to batchcreate_design_mat (
bool
) – Determines whether a design matrix is created. Defaults to True.random_seed (
int
) – The random seed to be used to initialize variables, defaults to 1234dtype (
dtype
) – datatype of the model parameters, defaults to torch.float64
- Return type
Any
- Returns
An object of class astir_bash.py.Astir using data imported from the loom files