slideflow.DatasetFeatures¶
- class DatasetFeatures(model: str | tf.keras.models.Model | torch.nn.Module, dataset: sf.Dataset, *, labels: Dict[str, str] | Dict[str, int] | Dict[str, List[float]] | None = None, cache: str | None = None, annotations: Dict[str, str] | Dict[str, int] | Dict[str, List[float]] | None = None, **kwargs: Any)[source]¶
Loads annotations, saved layer activations / features, and prepares output saving directories. Will also read/write processed features to a PKL cache file to save time in future iterations.
Note
Storing predictions along with layer features is optional, to offer the user reduced memory footprint. For example, saving predictions for a 10,000 slide dataset with 1000 categorical outcomes would require:
4 bytes/float32-logit * 1000 predictions/slide * 3000 tiles/slide * 10000 slides ~= 112 GB
Calculate features / layer activations from model, storing to internal parameters
self.activations
, andself.predictions
,self.locations
, dictionaries mapping slides to arrays of activations, predictions, and locations for each tiles’ constituent tiles.- Parameters:
model (str) – Path to model from which to calculate activations.
dataset (
slideflow.Dataset
) – Dataset from which to generate activations.labels (dict, optional) – Dict mapping slide names to outcome categories.
cache (str, optional) – File for PKL cache.
- Keyword Arguments:
augment (bool, str, optional) – Whether to use data augmentation during feature extraction. If True, will use default augmentation. If str, will use augmentation specified by the string. Defaults to None.
batch_size (int) – Batch size for activations calculations. Defaults to 32.
device (str, optional) – Device to use for feature extraction. Only used for PyTorch feature extractors. Defaults to None.
include_preds (bool) – Calculate and store predictions. Defaults to True.
include_uncertainty (bool, optional) – Whether to include model uncertainty in the output. Only used if the feature generator is a UQ-enabled model. Defaults to True.
layers (str, list(str)) – Layers to extract features from. May be the name of a single layer (str) or a list of layers (list). Only used if model is a str. Defaults to ‘postconv’.
normalizer ((str or
slideflow.norm.StainNormalizer
), optional) – Stain normalization strategy to use on image tiles prior to feature extraction. This argument is invalid ifmodel
is a feature extractor built from a trained model, as stain normalization will be specified by the model configuration. Defaults to None.normalizer_source (str, optional) – Stain normalization preset or path to a source image. Valid presets include ‘v1’, ‘v2’, and ‘v3’. If None, will use the default present (‘v3’). This argument is invalid if
model
is a feature extractor built from a trained model. Defaults to None.num_workers (int, optional) – Number of workers to use for feature extraction. Only used for PyTorch feature extractors. Defaults to None.
pool_sort (bool) – Use multiprocessing pools to perform final sorting. Defaults to True.
progress (bool) – Show a progress bar during feature calculation. Defaults to True.
transform (Callable, optional) – Custom transform to apply to images. Applied before standardization. If the feature extractor is a PyTorch model, the transform should be a torchvision transform.
verbose (bool) – Show verbose logging output. Defaults to True.
- Examples
Calculate features using a feature extractor.
import slideflow as sf # Create a feature extractor retccl = sf.build_feature_extractor('retccl', resize=True) # Load a dataset P = sf.load_project(...) dataset = P.dataset(...) # Calculate features dts_ftrs = sf.DatasetFeatures(retccl, dataset)
Calculate features using a trained model (preferred).
import slideflow as sf # Create a feature extractor from the saved model. extractor = sf.build_feature_extractor( '/path/to/trained_model.zip', layers=['postconv'] ) # Calculate features across the dataset dts_ftrs = sf.DatasetFeatures(extractor, dataset)
Calculate features using a trained model (legacy).
# This method is deprecated, and will be removed in a # future release. Please use the method above instead. dts_ftrs = sf.DatasetFeatures( '/path/to/trained_model.zip', dataset=dataset, layers=['postconv'] )
Calculate features from a loaded model.
import tensorflow as tf import slideflow as sf # Load a model model = tf.keras.models.load_model('/path/to/model.h5') # Calculate features dts_ftrs = sf.DatasetFeatures( model, layers=['postconv'], dataset )
Methods¶
- activations_by_category(self, idx: int) Dict[str | int | List[float], ndarray] ¶
For each outcome category, calculates activations of a given feature across all tiles in the category. Requires annotations to have been provided.
- box_plots(self, features: List[int], outdir: str) None ¶
Generates plots comparing node activations at slide- and tile-level.
- concat(args: Iterable[DatasetFeatures]) DatasetFeatures ¶
Concatenate activations from multiple DatasetFeatures together.
For example, if
df1
is a DatasetFeatures object with 2048 features anddf2
is a DatasetFeatures object with 1024 features, thensf.DatasetFeatures.concat([df1, df2])
would return an object with 3072.Vectors from DatasetFeatures objects are concatenated in the given order. During concatenation, predictions and uncertainty are dropped.
If there are any tiles that do not have calculated features in both dataframes, these will be dropped.
- Parameters:
args (Iterable[
DatasetFeatures
]) – DatasetFeatures objects to concatenate.- Returns:
DatasetFeatures object with concatenated features.
- Return type:
DatasetFeatures
- Examples
Concatenate two DatasetFeatures objects.
>>> df1 = DatasetFeatures(model, dataset, layers='postconv') >>> df2 = DatasetFeatures(model, dataset, layers='sepconv_3') >>> df = DatasetFeatures.concat([df1, df2])
- from_df(df: DataFrame) DatasetFeatures ¶
Load DataFrame of features, as exported by
DatasetFeatures.to_df()
- Parameters:
df (
pandas.DataFrame
) – DataFrame of features, as exported byDatasetFeatures.to_df()
- Returns:
DatasetFeatures object
- Return type:
DatasetFeatures
- Examples
Recreate DatasetFeatures after export to a DataFrame.
>>> df = features.to_df() >>> new_features = DatasetFeatures.from_df(df)
- load_cache(self, path: str)¶
Load cached activations from PKL.
- Parameters:
path (str) – Path to pkl cache.
- map_activations(self, **kwargs) SlideMap ¶
Map activations with UMAP.
- Keyword Arguments:
... –
- Returns:
sf.SlideMap
- map_predictions(self, x: int = 0, y: int = 0, **kwargs) SlideMap ¶
Map tile predictions onto x/y coordinate space.
- Parameters:
- Keyword Arguments:
cache (str, optional) – Path to parquet file to cache coordinates. Defaults to None (caching disabled).
- Returns:
sf.SlideMap
- merge(self, df: DatasetFeatures) None ¶
Merges with another DatasetFeatures.
- Parameters:
df (slideflow.DatasetFeatures) – TargetDatasetFeatures to merge with.
- Returns:
None
- save_cache(self, path: str)¶
Cache calculated activations to file.
- Parameters:
path (str) – Path to pkl.
- save_example_tiles(self, features: List[int], outdir: str, slides: List[str] | None = None, tiles_per_feature: int = 100) None ¶
For a set of activation features, saves image tiles named according to their corresponding activations.
Duplicate image tiles will be saved for each feature, organized into subfolders named according to feature.
- Parameters:
outdir (str) – Path to folder in which to save examples tiles.
slides (list, optional) – List of slide names. If provided, will only include tiles from these slides. Defaults to None.
tiles_per_feature (int, optional) – Number of tiles to include as examples for each feature. Defaults to 100. Will evenly sample this many tiles across the activation gradient.
- softmax_mean(self) Dict[str, ndarray] ¶
Calculates the mean prediction vector (post-softmax) across all tiles in each slide.
- Returns:
This is a dictionary mapping slides to the mean logits array for all tiles in each slide.
- Return type:
- softmax_percent(self, prediction_filter: List[int] | None = None) Dict[str, ndarray] ¶
Returns dictionary mapping slides to a vector of length num_classes with the percent of tiles in each slide predicted to be each outcome.
- Parameters:
prediction_filter – (optional) List of int. If provided, will restrict predictions to only these categories, with final prediction being based based on highest logit among these categories.
- Returns:
This is a dictionary mapping slides to an array of percentages for each logit, of length num_classes
- Return type:
- softmax_predict(self, prediction_filter: List[int] | None = None) Dict[str, int] ¶
Returns slide-level predictions, assuming the model is predicting a categorical outcome, by generating a prediction for each individual tile, and making a slide-level prediction by finding the most frequently predicted outcome among its constituent tiles.
- Parameters:
prediction_filter – (optional) List of int. If provided, will restrict predictions to only these categories, with final prediction based based on highest logit among these categories.
- Returns:
Dictionary mapping slide names to slide-level predictions.
- Return type:
- stats(self, outdir: str | None = None, method: str = 'mean', threshold: float = 0.5) Tuple[Dict[int, Dict[str, float]], Dict[int, Dict[str, float]], List[ndarray]] ¶
Calculates activation averages across categories, as well as tile-level and patient-level statistics, using ANOVA, exporting to CSV if desired.
- Parameters:
outdir (str, optional) – Path to directory in which CSV file will be saved. Defaults to None.
method (str, optional) – Indicates method of aggregating tile-level data into slide-level data. Either ‘mean’ (default) or ‘threshold’. If mean, slide-level feature data is calculated by averaging feature activations across all tiles. If threshold, slide-level feature data is calculated by counting the number of tiles with feature activations > threshold and dividing by the total number of tiles. Defaults to ‘mean’.
threshold (float, optional) – Threshold if using ‘threshold’ method.
- Returns:
A tuple containing
dict: Dict mapping slides to dict of slide-level features;
dict: Dict mapping features to tile-level statistics (‘p’, ‘f’);
dict: Dict mapping features to slide-level statistics (‘p’, ‘f’);
- to_csv(self, filename: str, level: str = 'tile', method: str = 'mean', slides: List[str] | None = None)¶
Exports calculated activations to csv.
- Parameters:
filename (str) – Path to CSV file for export.
level (str) – ‘tile’ or ‘slide’. Indicates whether tile or slide-level activations are saved. Defaults to ‘tile’.
method (str) – Method of summarizing slide-level results. Either ‘mean’ or ‘median’. Defaults to ‘mean’.
slides (list(str)) – Slides to export. If None, exports all slides. Defaults to None.
- to_df(self) DataFrame ¶
Export activations, predictions, uncertainty, and locations to a pandas DataFrame.
- Returns:
Dataframe with columns ‘activations’, ‘predictions’, ‘uncertainty’, and ‘locations’.
- Return type:
pd.core.frame.DataFrame