Operations¶
We do not recommend interacting with these functions directly. The core Pycytominer API uses these operations internally.
pycytominer.operations.correlation_threshold module¶
Returns list of features such that no two features have a correlation greater than a specified threshold
- pycytominer.operations.correlation_threshold.correlation_threshold(population_df, features='infer', samples='all', threshold=0.9, method='pearson')¶
Exclude features that have correlations above a certain threshold
- Parameters:
population_df (pandas.core.frame.DataFrame) – DataFrame that includes metadata and observation features.
features (list, default "infer") – A list of strings corresponding to feature measurement column names in the population_df DataFrame. All features listed must be found in population_df. Defaults to “infer”. If “infer”, then assume CellProfiler features are those prefixed with “Cells”, “Nuclei”, or “Cytoplasm”.
samples (str, default "all") – List of samples to perform operation on. The function uses a pd.DataFrame.query() function, so you should structure samples in this fashion. An example is “Metadata_treatment == ‘control’” (include all quotes). If “all”, use all samples to calculate.
float (threshold -) – Must be between (0, 1) to exclude features
0.9 (default) – Must be between (0, 1) to exclude features
str (method -) – indicating which correlation metric to use to test cutoff
"pearson" (default) – indicating which correlation metric to use to test cutoff
- Returns:
excluded_features – List of features to exclude from the population_df.
- Return type:
list of str
- pycytominer.operations.correlation_threshold.determine_high_cor_pair(correlation_row, sorted_correlation_pairs)¶
Select highest correlated variable given a correlation row with columns: [“pair_a”, “pair_b”, “correlation”]. For use in a pandas.apply().
- Parameters:
correlation_row (pandas.core.series.series) – Pandas series of the specific feature in the pairwise_df
sorted_correlation_pairs (pandas.DataFrame.index) – A sorted object by total correlative sum to all other features
- Return type:
The feature that has a lower total correlation sum with all other features
pycytominer.operations.get_na_columns module¶
Remove variables with specified threshold of NA values Note: This was called drop_na_columns in cytominer for R
- pycytominer.operations.get_na_columns.get_na_columns(population_df, features='infer', samples='all', cutoff=0.05)¶
Get features that have more NA values than cutoff defined
- Parameters:
population_df (pandas.core.frame.DataFrame) – DataFrame that includes metadata and observation features.
features (list, default "infer") – A list of strings corresponding to feature measurement column names in the profiles DataFrame. All features listed must be found in profiles. Defaults to “infer”. If “infer”, then assume CellProfiler features are those prefixed with “Cells”, “Nuclei”, or “Cytoplasm”.
samples (str, default "all") – List of samples to perform operation on. The function uses a pd.DataFrame.query() function, so you should structure samples in this fashion. An example is “Metadata_treatment == ‘control’” (include all quotes). If “all”, use all samples to calculate.
cutoff (float) – Exclude features that have a certain proportion of missingness
- Returns:
excluded_features – List of features to exclude from the population_df.
- Return type:
list of str
pycytominer.operations.transform module¶
Transform observation variables by specified groups.
References
- class pycytominer.operations.transform.RobustMAD(epsilon=1e-18)¶
Bases:
BaseEstimator
,TransformerMixin
Class to perform a “Robust” normalization with respect to median and mad
scaled = (x - median) / mad
- epsilon¶
fudge factor parameter
- Type:
float
- fit(X, y=None)¶
Compute the median and mad to be used for later scaling.
- Parameters:
X (pandas.core.frame.DataFrame) – dataframe to fit RobustMAD transform
- Returns:
With computed median and mad attributes
- Return type:
self
- set_transform_request(*, copy: bool | None | str = '$UNCHANGED$') RobustMAD ¶
Request metadata passed to the
transform
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed totransform
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it totransform
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.- Parameters:
copy (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
copy
parameter intransform
.- Returns:
self – The updated object.
- Return type:
object
- transform(X, copy=None)¶
Apply the RobustMAD calculation
- Parameters:
X (pandas.core.frame.DataFrame) – dataframe to fit RobustMAD transform
- Returns:
RobustMAD transformed dataframe
- Return type:
pandas.core.frame.DataFrame
- class pycytominer.operations.transform.Spherize(epsilon=1e-06, center=True, method='ZCA', return_numpy=False)¶
Bases:
BaseEstimator
,TransformerMixin
Class to apply a sphering transform (aka whitening) data in the base sklearn transform API. Note, this implementation is modified/inspired from the following sources: 1) A custom function written by Juan C. Caicedo 2) A custom ZCA function at https://github.com/mwv/zca 3) Notes from Niranj Chandrasekaran (https://github.com/cytomining/pycytominer/issues/90) 4) The R package “whitening” written by Strimmer et al (http://strimmerlab.org/software/whitening/) 5) Kessy et al. 2016 “Optimal Whitening and Decorrelation” [1]
- epsilon¶
fudge factor parameter
- Type:
float
- center¶
option to center the input X matrix
- Type:
bool
- method¶
a string indicating which class of sphering to perform
- Type:
str
- fit(X, y=None)¶
Identify the sphering transform given self.X
- Parameters:
X (pandas.core.frame.DataFrame) – dataframe to fit sphering transform
- Returns:
With computed weights attribute
- Return type:
self
- transform(X, y=None)¶
Perform the sphering transform
- Parameters:
X (pd.core.frame.DataFrame) – Profile dataframe to be transformed using the precompiled weights
y (None) – Has no effect; only used for consistency in sklearn transform API
- Returns:
Spherized dataframe
- Return type:
pandas.core.frame.DataFrame
pycytominer.operations.variance_threshold module¶
Remove variables with near-zero variance. Modified from caret::nearZeroVar()
- pycytominer.operations.variance_threshold.calculate_frequency(feature_column, freq_cut)¶
Calculate frequency of second most common to most common feature. Used in pandas.apply()
- Parameters:
feature_column (pandas.core.series.series) – Pandas series of the specific feature in the population_df
freq_cut (float, default 0.05) – Ratio (2nd most common feature val / most common). Must range between 0 and 1. Remove features lower than freq_cut. A low freq_cut will remove features that have large difference between the most common feature and second most common feature. (e.g. this will remove a feature: [1, 1, 1, 1, 0.01, 0.01, …])
- Return type:
Feature name if it passes threshold, “NA” otherwise
- pycytominer.operations.variance_threshold.variance_threshold(population_df, features='infer', samples='all', freq_cut=0.05, unique_cut=0.01)¶
Exclude features that have low variance (low information content)
- Parameters:
population_df (pandas.core.frame.DataFrame) – DataFrame that includes metadata and observation features.
features (list, default "infer") – A list of strings corresponding to feature measurement column names in the population_df DataFrame. All features listed must be found in population_df. Defaults to “infer”. If “infer”, then assume CellProfiler features are those prefixed with “Cells”, “Nuclei”, or “Cytoplasm”.
samples (str, default "all") – List of samples to perform operation on. The function uses a pd.DataFrame.query() function, so you should structure samples in this fashion. An example is “Metadata_treatment == ‘control’” (include all quotes). If “all”, use all samples to calculate.
freq_cut (float, default 0.05) – Ratio (2nd most common feature val / most common). Must range between 0 and 1. Remove features lower than freq_cut. A low freq_cut will remove features that have large difference between the most common feature value and second most common feature value. (e.g. this will remove a feature: [1, 1, 1, 1, 0.01, 0.01, …])
unique_cut (float, default 0.01) – Ratio (num unique features / num samples). Must range between 0 and 1. Remove features less than unique cut. A low unique_cut will remove features that have very few different measurements compared to the number of samples.
- Returns:
excluded_features – List of features to exclude from the population_df.
- Return type:
list of str