Operations

We do not recommend interacting with these functions directly. The core Pycytominer API uses these operations internally.

pycytominer.operations.correlation_threshold module

Returns list of features such that no two features have a correlation greater than a specified threshold

pycytominer.operations.correlation_threshold.correlation_threshold(population_df, features='infer', samples='all', threshold=0.9, method='pearson')

Exclude features that have correlations above a certain threshold

Parameters:
  • population_df (pandas.core.frame.DataFrame) – DataFrame that includes metadata and observation features.

  • features (list, default "infer") – A list of strings corresponding to feature measurement column names in the population_df DataFrame. All features listed must be found in population_df. Defaults to “infer”. If “infer”, then assume CellProfiler features are those prefixed with “Cells”, “Nuclei”, or “Cytoplasm”.

  • samples (str, default "all") – List of samples to perform operation on. The function uses a pd.DataFrame.query() function, so you should structure samples in this fashion. An example is “Metadata_treatment == ‘control’” (include all quotes). If “all”, use all samples to calculate.

  • float (threshold -) – Must be between (0, 1) to exclude features

  • 0.9 (default) – Must be between (0, 1) to exclude features

  • str (method -) – indicating which correlation metric to use to test cutoff

  • "pearson" (default) – indicating which correlation metric to use to test cutoff

Returns:

excluded_features – List of features to exclude from the population_df.

Return type:

list of str

pycytominer.operations.correlation_threshold.determine_high_cor_pair(correlation_row, sorted_correlation_pairs)

Select highest correlated variable given a correlation row with columns: [“pair_a”, “pair_b”, “correlation”]. For use in a pandas.apply().

Parameters:
  • correlation_row (pandas.core.series.series) – Pandas series of the specific feature in the pairwise_df

  • sorted_correlation_pairs (pandas.DataFrame.index) – A sorted object by total correlative sum to all other features

Return type:

The feature that has a lower total correlation sum with all other features

pycytominer.operations.get_na_columns module

Remove variables with specified threshold of NA values Note: This was called drop_na_columns in cytominer for R

pycytominer.operations.get_na_columns.get_na_columns(population_df, features='infer', samples='all', cutoff=0.05)

Get features that have more NA values than cutoff defined

Parameters:
  • population_df (pandas.core.frame.DataFrame) – DataFrame that includes metadata and observation features.

  • features (list, default "infer") – A list of strings corresponding to feature measurement column names in the profiles DataFrame. All features listed must be found in profiles. Defaults to “infer”. If “infer”, then assume CellProfiler features are those prefixed with “Cells”, “Nuclei”, or “Cytoplasm”.

  • samples (str, default "all") – List of samples to perform operation on. The function uses a pd.DataFrame.query() function, so you should structure samples in this fashion. An example is “Metadata_treatment == ‘control’” (include all quotes). If “all”, use all samples to calculate.

  • cutoff (float) – Exclude features that have a certain proportion of missingness

Returns:

excluded_features – List of features to exclude from the population_df.

Return type:

list of str

pycytominer.operations.transform module

Transform observation variables by specified groups.

References

class pycytominer.operations.transform.RobustMAD(epsilon=1e-18)

Bases: BaseEstimator, TransformerMixin

Class to perform a “Robust” normalization with respect to median and mad

scaled = (x - median) / mad

epsilon

fudge factor parameter

Type:

float

fit(X, y=None)

Compute the median and mad to be used for later scaling.

Parameters:

X (pandas.core.frame.DataFrame) – dataframe to fit RobustMAD transform

Returns:

With computed median and mad attributes

Return type:

self

set_transform_request(*, copy: bool | None | str = '$UNCHANGED$') RobustMAD

Request metadata passed to the transform method.

Note that this method is only relevant if enable_metadata_routing=True (see sklearn.set_config()). Please see User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Note

This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a Pipeline. Otherwise it has no effect.

Parameters:

copy (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for copy parameter in transform.

Returns:

self – The updated object.

Return type:

object

transform(X, copy=None)

Apply the RobustMAD calculation

Parameters:

X (pandas.core.frame.DataFrame) – dataframe to fit RobustMAD transform

Returns:

RobustMAD transformed dataframe

Return type:

pandas.core.frame.DataFrame

class pycytominer.operations.transform.Spherize(epsilon=1e-06, center=True, method='ZCA', return_numpy=False)

Bases: BaseEstimator, TransformerMixin

Class to apply a sphering transform (aka whitening) data in the base sklearn transform API. Note, this implementation is modified/inspired from the following sources: 1) A custom function written by Juan C. Caicedo 2) A custom ZCA function at https://github.com/mwv/zca 3) Notes from Niranj Chandrasekaran (https://github.com/cytomining/pycytominer/issues/90) 4) The R package “whitening” written by Strimmer et al (http://strimmerlab.org/software/whitening/) 5) Kessy et al. 2016 “Optimal Whitening and Decorrelation” [1]

epsilon

fudge factor parameter

Type:

float

center

option to center the input X matrix

Type:

bool

method

a string indicating which class of sphering to perform

Type:

str

fit(X, y=None)

Identify the sphering transform given self.X

Parameters:

X (pandas.core.frame.DataFrame) – dataframe to fit sphering transform

Returns:

With computed weights attribute

Return type:

self

transform(X, y=None)

Perform the sphering transform

Parameters:
  • X (pd.core.frame.DataFrame) – Profile dataframe to be transformed using the precompiled weights

  • y (None) – Has no effect; only used for consistency in sklearn transform API

Returns:

Spherized dataframe

Return type:

pandas.core.frame.DataFrame

pycytominer.operations.variance_threshold module

Remove variables with near-zero variance. Modified from caret::nearZeroVar()

pycytominer.operations.variance_threshold.calculate_frequency(feature_column, freq_cut)

Calculate frequency of second most common to most common feature. Used in pandas.apply()

Parameters:
  • feature_column (pandas.core.series.series) – Pandas series of the specific feature in the population_df

  • freq_cut (float, default 0.05) – Ratio (2nd most common feature val / most common). Must range between 0 and 1. Remove features lower than freq_cut. A low freq_cut will remove features that have large difference between the most common feature and second most common feature. (e.g. this will remove a feature: [1, 1, 1, 1, 0.01, 0.01, …])

Return type:

Feature name if it passes threshold, “NA” otherwise

pycytominer.operations.variance_threshold.variance_threshold(population_df, features='infer', samples='all', freq_cut=0.05, unique_cut=0.01)

Exclude features that have low variance (low information content)

Parameters:
  • population_df (pandas.core.frame.DataFrame) – DataFrame that includes metadata and observation features.

  • features (list, default "infer") – A list of strings corresponding to feature measurement column names in the population_df DataFrame. All features listed must be found in population_df. Defaults to “infer”. If “infer”, then assume CellProfiler features are those prefixed with “Cells”, “Nuclei”, or “Cytoplasm”.

  • samples (str, default "all") – List of samples to perform operation on. The function uses a pd.DataFrame.query() function, so you should structure samples in this fashion. An example is “Metadata_treatment == ‘control’” (include all quotes). If “all”, use all samples to calculate.

  • freq_cut (float, default 0.05) – Ratio (2nd most common feature val / most common). Must range between 0 and 1. Remove features lower than freq_cut. A low freq_cut will remove features that have large difference between the most common feature value and second most common feature value. (e.g. this will remove a feature: [1, 1, 1, 1, 0.01, 0.01, …])

  • unique_cut (float, default 0.01) – Ratio (num unique features / num samples). Must range between 0 and 1. Remove features less than unique cut. A low unique_cut will remove features that have very few different measurements compared to the number of samples.

Returns:

excluded_features – List of features to exclude from the population_df.

Return type:

list of str

Module contents