• Docs >
  • slideflow.DatasetFeatures
Shortcuts

slideflow.DatasetFeatures

class DatasetFeatures(model: str | tf.keras.models.Model | torch.nn.Module, dataset: sf.Dataset, *, labels: Dict[str, str] | Dict[str, int] | Dict[str, List[float]] | None = None, cache: str | None = None, annotations: Dict[str, str] | Dict[str, int] | Dict[str, List[float]] | None = None, **kwargs: Any)[source]

Loads annotations, saved layer activations / features, and prepares output saving directories. Will also read/write processed features to a PKL cache file to save time in future iterations.

Note

Storing predictions along with layer features is optional, to offer the user reduced memory footprint. For example, saving predictions for a 10,000 slide dataset with 1000 categorical outcomes would require:

4 bytes/float32-logit * 1000 predictions/slide * 3000 tiles/slide * 10000 slides ~= 112 GB

Calculate features / layer activations from model, storing to internal parameters self.activations, and self.predictions, self.locations, dictionaries mapping slides to arrays of activations, predictions, and locations for each tiles’ constituent tiles.

Parameters:
  • model (str) – Path to model from which to calculate activations.

  • dataset (slideflow.Dataset) – Dataset from which to generate activations.

  • labels (dict, optional) – Dict mapping slide names to outcome categories.

  • cache (str, optional) – File for PKL cache.

Keyword Arguments:
  • augment (bool, str, optional) – Whether to use data augmentation during feature extraction. If True, will use default augmentation. If str, will use augmentation specified by the string. Defaults to None.

  • batch_size (int) – Batch size for activations calculations. Defaults to 32.

  • device (str, optional) – Device to use for feature extraction. Only used for PyTorch feature extractors. Defaults to None.

  • include_preds (bool) – Calculate and store predictions. Defaults to True.

  • include_uncertainty (bool, optional) – Whether to include model uncertainty in the output. Only used if the feature generator is a UQ-enabled model. Defaults to True.

  • layers (str, list(str)) – Layers to extract features from. May be the name of a single layer (str) or a list of layers (list). Only used if model is a str. Defaults to ‘postconv’.

  • normalizer ((str or slideflow.norm.StainNormalizer), optional) – Stain normalization strategy to use on image tiles prior to feature extraction. This argument is invalid if model is a feature extractor built from a trained model, as stain normalization will be specified by the model configuration. Defaults to None.

  • normalizer_source (str, optional) – Stain normalization preset or path to a source image. Valid presets include ‘v1’, ‘v2’, and ‘v3’. If None, will use the default present (‘v3’). This argument is invalid if model is a feature extractor built from a trained model. Defaults to None.

  • num_workers (int, optional) – Number of workers to use for feature extraction. Only used for PyTorch feature extractors. Defaults to None.

  • pool_sort (bool) – Use multiprocessing pools to perform final sorting. Defaults to True.

  • progress (bool) – Show a progress bar during feature calculation. Defaults to True.

  • verbose (bool) – Show verbose logging output. Defaults to True.

Examples

Calculate features using a feature extractor.

import slideflow as sf
from slideflow.model import build_feature_extractor

# Create a feature extractor
retccl = build_feature_extractor('retccl', tile_px=299)

# Load a dataset
P = sf.load_project(...)
dataset = P.dataset(...)

# Calculate features
dts_ftrs = sf.DatasetFeatures(retccl, dataset)

Calculate features using a trained model (preferred).

from slideflow.model import build_feature_extractor

# Create a feature extractor from the saved model.
extractor = build_feature_extractor(
    '/path/to/trained_model.zip',
    layers=['postconv']
)

# Calculate features across the dataset
dts_ftrs = sf.DatasetFeatures(extractor, dataset)

Calculate features using a trained model (legacy).

# This method is deprecated, and will be removed in a
# future release. Please use the method above instead.
dts_ftrs = sf.DatasetFeatures(
    '/path/to/trained_model.zip',
    dataset=dataset,
    layers=['postconv']
)

Calculate features from a loaded model.

import tensorflow as tf
import slideflow as sf

# Load a model
model = tf.keras.models.load_model('/path/to/model.h5')

# Calculate features
dts_ftrs = sf.DatasetFeatures(
    model,
    layers=['postconv'],
    dataset
)

Methods

activations_by_category(self, idx: int) Dict[str | int | List[float], ndarray]

For each outcome category, calculates activations of a given feature across all tiles in the category. Requires annotations to have been provided.

Parameters:

idx (int) – Index of activations layer to return, stratified by outcome category.

Returns:

Dict mapping categories to feature activations for all tiles in the category.

Return type:

dict

box_plots(self, features: List[int], outdir: str) None

Generates plots comparing node activations at slide- and tile-level.

Parameters:
  • features (list(int)) – List of feature indices for which to generate box plots.

  • outdir (str) – Path to directory in which to save box plots.

concat(args: Iterable[DatasetFeatures]) DatasetFeatures

Concatenate activations from multiple DatasetFeatures together.

For example, if df1 is a DatasetFeatures object with 2048 features and df2 is a DatasetFeatures object with 1024 features, then sf.DatasetFeatures.concat([df1, df2]) would return an object with 3072.

Vectors from DatasetFeatures objects are concatenated in the given order. During concatenation, predictions and uncertainty are dropped.

If there are any tiles that do not have calculated features in both dataframes, these will be dropped.

Parameters:

args (Iterable[DatasetFeatures]) – DatasetFeatures objects to concatenate.

Returns:

DatasetFeatures object with concatenated features.

Return type:

DatasetFeatures

Examples

Concatenate two DatasetFeatures objects.

>>> df1 = DatasetFeatures(model, dataset, layers='postconv')
>>> df2 = DatasetFeatures(model, dataset, layers='sepconv_3')
>>> df = DatasetFeatures.concat([df1, df2])
from_df(df: DataFrame) DatasetFeatures

Load DataFrame of features, as exported by DatasetFeatures.to_df()

Parameters:

df (pandas.DataFrame) – DataFrame of features, as exported by DatasetFeatures.to_df()

Returns:

DatasetFeatures object

Return type:

DatasetFeatures

Examples

Recreate DatasetFeatures after export to a DataFrame.

>>> df = features.to_df()
>>> new_features = DatasetFeatures.from_df(df)
load_cache(self, path: str)

Load cached activations from PKL.

Parameters:

path (str) – Path to pkl cache.

map_activations(self, **kwargs) SlideMap

Map activations with UMAP.

Keyword Arguments:

...

Returns:

sf.SlideMap

map_predictions(self, x: int = 0, y: int = 0, **kwargs) SlideMap

Map tile predictions onto x/y coordinate space.

Parameters:
  • x (int, optional) – Outcome category id for which predictions will be mapped to the X-axis. Defaults to 0.

  • y (int, optional) – Outcome category id for which predictions will be mapped to the Y-axis. Defaults to 0.

Keyword Arguments:

cache (str, optional) – Path to parquet file to cache coordinates. Defaults to None (caching disabled).

Returns:

sf.SlideMap

merge(self, df: DatasetFeatures) None

Merges with another DatasetFeatures.

Parameters:

df (slideflow.DatasetFeatures) – TargetDatasetFeatures to merge with.

Returns:

None

remove_slide(self, slide: str) None

Removes slide from calculated features.

save_cache(self, path: str)

Cache calculated activations to file.

Parameters:

path (str) – Path to pkl.

save_example_tiles(self, features: List[int], outdir: str, slides: List[str] | None = None, tiles_per_feature: int = 100) None

For a set of activation features, saves image tiles named according to their corresponding activations.

Duplicate image tiles will be saved for each feature, organized into subfolders named according to feature.

Parameters:
  • features (list(int)) – Features to evaluate.

  • outdir (str) – Path to folder in which to save examples tiles.

  • slides (list, optional) – List of slide names. If provided, will only include tiles from these slides. Defaults to None.

  • tiles_per_feature (int, optional) – Number of tiles to include as examples for each feature. Defaults to 100. Will evenly sample this many tiles across the activation gradient.

softmax_mean(self) Dict[str, ndarray]

Calculates the mean prediction vector (post-softmax) across all tiles in each slide.

Returns:

This is a dictionary mapping slides to the mean logits array for all tiles in each slide.

Return type:

dict

softmax_percent(self, prediction_filter: List[int] | None = None) Dict[str, ndarray]

Returns dictionary mapping slides to a vector of length num_classes with the percent of tiles in each slide predicted to be each outcome.

Parameters:

prediction_filter – (optional) List of int. If provided, will restrict predictions to only these categories, with final prediction being based based on highest logit among these categories.

Returns:

This is a dictionary mapping slides to an array of percentages for each logit, of length num_classes

Return type:

dict

softmax_predict(self, prediction_filter: List[int] | None = None) Dict[str, int]

Returns slide-level predictions, assuming the model is predicting a categorical outcome, by generating a prediction for each individual tile, and making a slide-level prediction by finding the most frequently predicted outcome among its constituent tiles.

Parameters:

prediction_filter – (optional) List of int. If provided, will restrict predictions to only these categories, with final prediction based based on highest logit among these categories.

Returns:

Dictionary mapping slide names to slide-level predictions.

Return type:

dict

stats(self, outdir: str | None = None, method: str = 'mean', threshold: float = 0.5) Tuple[Dict[int, Dict[str, float]], Dict[int, Dict[str, float]], List[ndarray]]

Calculates activation averages across categories, as well as tile-level and patient-level statistics, using ANOVA, exporting to CSV if desired.

Parameters:
  • outdir (str, optional) – Path to directory in which CSV file will be saved. Defaults to None.

  • method (str, optional) – Indicates method of aggregating tile-level data into slide-level data. Either ‘mean’ (default) or ‘threshold’. If mean, slide-level feature data is calculated by averaging feature activations across all tiles. If threshold, slide-level feature data is calculated by counting the number of tiles with feature activations > threshold and dividing by the total number of tiles. Defaults to ‘mean’.

  • threshold (float, optional) – Threshold if using ‘threshold’ method.

Returns:

A tuple containing

dict: Dict mapping slides to dict of slide-level features;

dict: Dict mapping features to tile-level statistics (‘p’, ‘f’);

dict: Dict mapping features to slide-level statistics (‘p’, ‘f’);

to_csv(self, filename: str, level: str = 'tile', method: str = 'mean', slides: List[str] | None = None)

Exports calculated activations to csv.

Parameters:
  • filename (str) – Path to CSV file for export.

  • level (str) – ‘tile’ or ‘slide’. Indicates whether tile or slide-level activations are saved. Defaults to ‘tile’.

  • method (str) – Method of summarizing slide-level results. Either ‘mean’ or ‘median’. Defaults to ‘mean’.

  • slides (list(str)) – Slides to export. If None, exports all slides. Defaults to None.

to_df(self) DataFrame

Export activations, predictions, uncertainty, and locations to a pandas DataFrame.

Returns:

Dataframe with columns ‘activations’, ‘predictions’, ‘uncertainty’, and ‘locations’.

Return type:

pd.core.frame.DataFrame

to_torch(self, outdir: str, slides: List[str] | None = None, verbose: bool = True) None

Export activations in torch format to .pt files in the directory.

Used for training MIL models.

Parameters:
  • outdir (str) – Path to directory in which to save .pt files.

  • verbose (bool) – Verbose logging output. Defaults to True.