slideflow.Dataset¶
- class Dataset(config: str | Dict[str, Dict[str, str]] | None = None, sources: str | List[str] | None = None, tile_px: int | None = None, tile_um: str | int | None = None, *, tfrecords: str | None = None, tiles: str | None = None, roi: str | None = None, slides: str | None = None, filters: Dict | None = None, filter_blank: str | List[str] | None = None, annotations: str | DataFrame | None = None, min_tiles: int = 0)[source]¶
Supervises organization and processing of slides, tfrecords, and tiles.
Datasets can be comprised of one or more sources, where a source is a combination of slides and any associated regions of interest (ROI) and extracted image tiles (stored as TFRecords or loose images).
Datasets can be created in two ways: either by loading one dataset source, or by loading a dataset configuration that contains information about multiple dataset sources.
For the first approach, the dataset source configuration is provided via keyword arguments (
tiles
,tfrecords
,slides
, androi
). Each is a path to a directory containing the respective data.For the second approach, the first argument
config
is either a nested dictionary containing the configuration for multiple dataset sources, or a path to a JSON file with this information. The second argument is a list of dataset sources to load (keys from theconfig
dictionary).With either approach, slide/patient-level annotations are provided through the
annotations
keyword argument, which can either be a path to a CSV file, or a pandas DataFrame, which must contain at minimum the column ‘patient’.Initialize a Dataset to organize processed images.
- Examples
Load a dataset via keyword arguments.
dataset = Dataset( tfrecords='../path', slides='../path', annotations='../file.csv' )
Load a dataset configuration file and specify dataset source(s).
dataset = Dataset( config='../path/to/config.json', sources=['Lung_Adeno', 'Lung_Squam'], annotations='../file.csv )
- Parameters:
config (str, dict) – Either a dictionary or a path to a JSON file. If a dictionary, keys should be dataset source names, and values should be dictionaries containing the keys ‘tiles’, ‘tfrecords’, ‘roi’, and/or ‘slides’, specifying directories for each dataset source. If config is a str, it should be a path to a JSON file containing a dictionary with the same formatting. If None, tiles, tfrecords, roi and/or slides should be manually provided via keyword arguments. Defaults to None.
sources (List[str]) – List of dataset sources to include from configuration. If not provided, will use all sources in the provided configuration. Defaults to None.
tile_px (int) – Tile size in pixels.
tile_um (int or str) – Tile size in microns (int) or magnification (str, e.g. “20x”).
- Keyword Arguments:
filters (dict, optional) – Dataset filters to use for selecting slides. See
slideflow.Dataset.filter()
for more information. Defaults to None.filter_blank (list(str) or str, optional) – Skip slides that have blank values in these patient annotation columns. Defaults to None.
min_tiles (int, optional) – Only include slides with this many tiles at minimum. Defaults to 0.
annotations (str or pd.DataFrame, optional) – Path to annotations file or pandas DataFrame with slide-level annotations. Defaults to None.
- Raises:
errors.SourceNotFoundError – If provided source does not exist in the dataset config.
Attributes¶
|
Pandas DataFrame of all loaded clinical annotations. |
|
Returns the active filters, if any. |
|
Returns the active filter_blank filter, if any. |
|
Pandas DataFrame of clinical annotations, after filtering. |
|
Format of images stored in TFRecords (jpg/png). |
|
Returns the active min_tiles filter, if any (defaults to 0). |
|
Number of tiles in tfrecords after filtering/clipping. |
Methods¶
- balance(self, headers: str | List[str] | None = None, strategy: str | None = 'category', *, force: bool = False) Dataset ¶
Return a dataset with mini-batch balancing configured.
Mini-batch balancing can be configured at tile, slide, patient, or category levels.
Balancing information is saved to the attribute
prob_weights
, which is used by the interleaving dataloaders when sampling from tfrecords to create a batch.Tile level balancing will create prob_weights reflective of the number of tiles per slide, thus causing the batch sampling to mirror random sampling from the entire population of tiles (rather than randomly sampling from slides).
Slide level balancing is the default behavior, where batches are assembled by randomly sampling from each slide/tfrecord with equal probability. This balancing behavior would be the same as no balancing.
Patient level balancing is used to randomly sample from individual patients with equal probability. This is distinct from slide level balancing, as some patients may have multiple slides per patient.
Category level balancing takes a list of annotation header(s) and generates prob_weights such that each category is sampled equally. This requires categorical outcomes.
- Parameters:
headers (list of str, optional) – List of annotation headers if balancing by category. Defaults to None.
strategy (str, optional) – ‘tile’, ‘slide’, ‘patient’ or ‘category’. Create prob_weights used to balance dataset batches to evenly distribute slides, patients, or categories in a given batch. Tile-level balancing generates prob_weights reflective of the total number of tiles in a slide. Defaults to ‘category.’
force (bool, optional) – If using category-level balancing, interpret all headers as categorical variables, even if the header appears to be a float.
- Returns:
balanced
slideflow.Dataset
object.
- build_index(self, force: bool = True, *, num_workers: int | None = None) None ¶
Build index files for TFRecords.
- cell_segmentation(self, diam_um: float, dest: str, *, model: cellpose.models.Cellpose | str = 'cyto2', window_size: int = 256, diam_mean: int | None = None, qc: str | None = None, qc_kwargs: dict | None = None, buffer: str | None = None, q_size: int = 2, force: bool = False, save_centroid: bool = True, save_flow: bool = False, **kwargs) None ¶
Perform cell segmentation on slides, saving segmentation masks.
- Parameters:
- Keyword Arguments:
batch_size (int) – Batch size for cell segmentation. Defaults to 8.
cp_thresh (float) – Cell probability threshold. All pixels with value above threshold kept for masks, decrease to find more and larger masks. Defaults to 0.
diam_mean (int, optional) – Cell diameter to detect, in pixels (without image resizing). If None, uses Cellpose defaults (17 for the ‘nuclei’ model, 30 for all others).
downscale (float) – Factor by which to downscale generated masks after calculation. Defaults to None (keep masks at original size).
flow_threshold (float) – Flow error threshold (all cells with errors below threshold are kept). Defaults to 0.4.
gpus (int, list(int)) – GPUs to use for cell segmentation. Defaults to 0 (first GPU).
interp (bool) – Interpolate during 2D dynamics. Defaults to True.
qc (str) – Slide-level quality control method to use before performing cell segmentation. Defaults to “Otsu”.
model (str,
cellpose.models.Cellpose
) – Cellpose model to use for cell segmentation. May be any valid cellpose model. Defaults to ‘cyto2’.mpp (float) – Microns-per-pixel at which cells should be segmented. Defaults to 0.5.
num_workers (int, optional) – Number of workers. Defaults to 2 * num_gpus.
save_centroid (bool) – Save mask centroids. Increases memory utilization slightly. Defaults to True.
save_flow (bool) – Save flow values for the whole-slide image. Increases memory utilization. Defaults to False.
sources (List[str]) – List of dataset sources to include from configuration file.
tile (bool) – Tiles image to decrease GPU/CPU memory usage. Defaults to True.
verbose (bool) – Verbose log output at the INFO level. Defaults to True.
window_size (int) – Window size at which to segment cells across a whole-slide image. Defaults to 256.
- Returns:
None
- check_duplicates(self, dataset: Dataset | None = None, px: int = 64, mse_thresh: int = 100) List[Tuple[str, str]] ¶
Check for duplicate slides by comparing slide thumbnails.
- Parameters:
- Returns:
List of path pairs of potential duplicates.
- Return type:
List[str], optional
- clear_filters(self) Dataset ¶
Return a dataset with all filters cleared.
- Returns:
slideflow.Dataset
object.
- clip(self, max_tiles: int = 0, strategy: str | None = None, headers: List[str] | None = None) Dataset ¶
Return a dataset with TFRecords clipped to a max number of tiles.
Clip the number of tiles per tfrecord to a given maximum value and/or to the min number of tiles per patient or category.
- Parameters:
max_tiles (int, optional) – Clip the maximum number of tiles per tfrecord to this number. Defaults to 0 (do not perform tfrecord-level clipping).
strategy (str, optional) – ‘slide’, ‘patient’, or ‘category’. Clip the maximum number of tiles to the minimum tiles seen across slides, patients, or categories. If ‘category’, headers must be provided. Defaults to None (do not perform group-level clipping).
headers (list of str, optional) – List of annotation headers to use if clipping by minimum category count (strategy=’category’). Defaults to None.
- Returns:
clipped
slideflow.Dataset
object.
- convert_xml_rois(self)¶
Convert ImageScope XML ROI files to QuPath format CSV ROI files.
- extract_cells(self, masks_path: str, **kwargs) Dict[str, SlideReport] ¶
Extract cell images from slides, with a tile at each cell centroid.
Requires that cells have already been segmented with
Dataset.cell_segmentation()
.- Parameters:
masks_path (str) – Location of saved segmentation masks.
- Keyword Arguments:
apply_masks (bool) – Apply cell segmentation masks to the extracted tiles. Defaults to True.
**kwargs – All other keyword arguments for
Dataset.extract_tiles()
.
- Returns:
Dictionary mapping slide paths to each slide’s SlideReport (
slideflow.slide.report.SlideReport
)
- extract_tiles(self, *, save_tiles: bool = False, save_tfrecords: bool = True, source: str | None = None, stride_div: int = 1, enable_downsample: bool = True, roi_method: str = 'auto', roi_filter_method: str | float = 'center', skip_extracted: bool = True, tma: bool = False, randomize_origin: bool = False, buffer: str | None = None, q_size: int = 2, qc: str | Callable | List[Callable] | None = None, report: bool = True, use_edge_tiles: bool = False, artifact_labels: str | List[str] | None = [], mpp_override: float | None = None, **kwargs: Any) Dict[str, SlideReport] ¶
Extract tiles from a group of slides.
Extracted tiles are saved either loose image or in TFRecord format.
Extracted tiles are either saved in TFRecord format (
save_tfrecords=True
, default) or as loose *.jpg / *.png images (save_tiles=True
). TFRecords or image tiles are saved in the the tfrecord and tile directories configured byslideflow.Dataset
.- Keyword Arguments:
save_tiles (bool, optional) – Save tile images in loose format. Defaults to False.
save_tfrecords (bool) – Save compressed image data from extracted tiles into TFRecords in the corresponding TFRecord directory. Defaults to True.
source (str, optional) – Name of dataset source from which to select slides for extraction. Defaults to None. If not provided, will default to all sources in project.
stride_div (int) – Stride divisor for tile extraction. A stride of 1 will extract non-overlapping tiles. A stride_div of 2 will extract overlapping tiles, with a stride equal to 50% of the tile width. Defaults to 1.
enable_downsample (bool) – Enable downsampling for slides. This may result in corrupted image tiles if downsampled slide layers are corrupted or incomplete. Defaults to True.
roi_method (str) – Either ‘inside’, ‘outside’, ‘auto’, or ‘ignore’. Determines how ROIs are used to extract tiles. If ‘inside’ or ‘outside’, will extract tiles in/out of an ROI, and skip the slide if an ROI is not available. If ‘auto’, will extract tiles inside an ROI if available, and across the whole-slide if no ROI is found. If ‘ignore’, will extract tiles across the whole-slide regardless of whether an ROI is available. Defaults to ‘auto’.
roi_filter_method (str or float) – Method of filtering tiles with ROIs. Either ‘center’ or float (0-1). If ‘center’, tiles are filtered with ROIs based on the center of the tile. If float, tiles are filtered based on the proportion of the tile inside the ROI, and
roi_filter_method
is interpreted as a threshold. If the proportion of a tile inside the ROI is greater than this number, the tile is included. For example, ifroi_filter_method=0.7
, a tile that is 80% inside of an ROI will be included, and a tile that is 50% inside of an ROI will be excluded. Defaults to ‘center’.skip_extracted (bool) – Skip slides that have already been extracted. Defaults to True.
tma (bool) – Reads slides as Tumor Micro-Arrays (TMAs). Deprecated argument; all slides are now read as standard WSIs.
randomize_origin (bool) – Randomize pixel starting position during extraction. Defaults to False.
buffer (str, optional) – Slides will be copied to this directory before extraction. Defaults to None. Using an SSD or ramdisk buffer vastly improves tile extraction speed.
q_size (int) – Size of queue when using a buffer. Defaults to 2.
qc (str, optional) – ‘otsu’, ‘blur’, ‘both’, or None. Perform blur detection quality control - discarding tiles with detected out-of-focus regions or artifact - and/or otsu’s method. Increases tile extraction time. Defaults to None.
report (bool) – Save a PDF report of tile extraction. Defaults to True.
normalizer (str, optional) – Normalization strategy. Defaults to None.
normalizer_source (str, optional) – Stain normalization preset or path to a source image. Valid presets include ‘v1’, ‘v2’, and ‘v3’. If None, will use the default present (‘v3’). Defaults to None.
whitespace_fraction (float, optional) – Range 0-1. Discard tiles with this fraction of whitespace. If 1, will not perform whitespace filtering. Defaults to 1.
whitespace_threshold (int, optional) – Range 0-255. Defaults to 230. Threshold above which a pixel (RGB average) is whitespace.
grayspace_fraction (float, optional) – Range 0-1. Defaults to 0.6. Discard tiles with this fraction of grayspace. If 1, will not perform grayspace filtering.
grayspace_threshold (float, optional) – Range 0-1. Defaults to 0.05. Pixels in HSV format with saturation below this threshold are considered grayspace.
img_format (str, optional) – ‘png’ or ‘jpg’. Defaults to ‘jpg’. Image format to use in tfrecords. PNG (lossless) for fidelity, JPG (lossy) for efficiency.
shuffle (bool, optional) – Shuffle tiles prior to storage in tfrecords. Defaults to True.
num_threads (int, optional) – Number of worker processes for each tile extractor. When using cuCIM slide reading backend, defaults to the total number of available CPU cores, using the ‘fork’ multiprocessing method. With Libvips, this defaults to the total number of available CPU cores or 32, whichever is lower, using ‘spawn’ multiprocessing.
qc_blur_radius (int, optional) – Quality control blur radius for out-of-focus area detection. Used if qc=True. Defaults to 3.
qc_blur_threshold (float, optional) – Quality control blur threshold for detecting out-of-focus areas. Only used if qc=True. Defaults to 0.1
qc_filter_threshold (float, optional) – Float between 0-1. Tiles with more than this proportion of blur will be discarded. Only used if qc=True. Defaults to 0.6.
qc_mpp (float, optional) – Microns-per-pixel indicating image magnification level at which quality control is performed. Defaults to mpp=4 (effective magnification 2.5 X)
dry_run (bool, optional) – Determine tiles that would be extracted, but do not export any images. Defaults to None.
max_tiles (int, optional) – Only extract this many tiles per slide. Defaults to None.
use_edge_tiles (bool) – Use edge tiles in extraction. Areas outside the slide will be padded white. Defaults to False.
artifact_labels (list(str) or str, optional) – List of ROI issue labels to treat as artifacts. Whenever this is not None, all the ROIs with referred label will be inverted with ROI.invert(). Defaults to an empty list.
mpp_override (float, optional) – Override the microns-per-pixel for each slide. If None, will auto-detect microns-per-pixel for all slides and raise an error if MPP is not found. Defaults to None.
- Returns:
Dictionary mapping slide paths to each slide’s SlideReport (
slideflow.slide.report.SlideReport
)
- extract_tiles_from_tfrecords(self, dest: str) None ¶
Extract tiles from a set of TFRecords.
- Parameters:
dest (str) – Path to directory in which to save tile images. If None, uses dataset default. Defaults to None.
- filter(self, *args: Any, **kwargs: Any) Dataset ¶
Return a filtered dataset.
This method can either accept a single argument (
filters
) or any combination of keyword arguments (filters
,filter_blank
, ormin_tiles
).- Keyword Arguments:
filters (dict, optional) – Dictionary used for filtering the dataset. Dictionary keys should be column headers in the patient annotations, and the values should be the variable states to be included in the dataset. For example,
filters={'HPV_status': ['negative', 'positive']}
would filter the dataset by the columnHPV_status
and only include slides with values of either'negative'
or'positive'
in this column. See Filtering for further discussion. Defaults to None.filter_blank (list(str) or str, optional) – Skip slides that have blank values in these patient annotation columns. Defaults to None.
min_tiles (int) – Filter out tfrecords that have less than this minimum number of tiles. Defaults to 0.
- Returns:
Dataset with filter added.
- Return type:
- find_slide(self, *, slide: str | None = None, patient: str | None = None) str | None ¶
Find a slide path from a given slide or patient.
- find_tfrecord(self, *, slide: str | None = None, patient: str | None = None) str | None ¶
Find a TFRecord path from a given slide or patient.
- generate_feature_bags(self, model: str | BaseFeatureExtractor, outdir: str, *, force_regenerate: bool = False, batch_size: int = 32, slide_batch_size: int = 16, num_gpus: int = 0, **kwargs: Any) None ¶
Generate bags of tile-level features for slides for use with MIL models.
- Parameters:
- Keyword Arguments:
layers (list) – Which model layer(s) generate activations. If
model
is a saved model, this defaults to ‘postconv’. Not used ifmodel
is pretrained feature extractor. Defaults to None.force_regenerate (bool) – Forcibly regenerate activations for all slides even if .pt file exists. Defaults to False.
batch_size (int) – Batch size during feature calculation. Defaults to 32.
slide_batch_size (int) – Interleave feature calculation across this many slides. Higher values may improve performance but require more memory. Defaults to 16.
num_gpus (int) – Number of GPUs to use for feature extraction. Defaults to 0.
**kwargs – Additional keyword arguments are passed to
slideflow.DatasetFeatures
.
- get_tfrecord_locations(self, slide: str) List[Tuple[int, int]] ¶
Return a list of locations stored in an associated TFRecord.
- Parameters:
slide (str) – Slide name.
- Returns:
List of tuples of (x, y) coordinates.
- get_tile_dataframe(self, roi_method: str = 'auto', stride_div: int = 1) DataFrame ¶
Generate a pandas dataframe with tile-level ROI labels.
- Returns:
loc_x
: X-coordinate of tile centerloc_y
: Y-coordinate of tile centergrid_x
: X grid index of the tilegrid_y
: Y grid index of the tileroi_name
: Name of the ROI if tile is in an ROI, else Noneroi_desc
: Description of the ROI if tile is in ROI, else Nonelabel
: ROI label, if present.
- Return type:
Pandas dataframe of all tiles, with the following columns
- harmonize_labels(self, *args: Dataset, header: str | None = None) Dict[str, int] ¶
Harmonize labels with another dataset.
Returns categorical label assignments converted to int, harmonized with another dataset to ensure label consistency between datasets.
- Parameters:
*args (
slideflow.Dataset
) – Any number of Datasets.header (str) – Categorical annotation header.
- Returns:
Dict mapping slide names to categories.
- is_float(self, header: str) bool ¶
Check if labels in the given header can all be converted to float.
- kfold_split(self, k: int, *, labels: Dict | str | None = None, preserved_site: bool = False, site_labels: str | Dict[str, str] | None = 'site', splits: str | None = None, read_only: bool = False) Tuple[Tuple[Dataset, Dataset], ...] ¶
Split the dataset into k cross-folds.
- Parameters:
k (int) – Number of cross-folds.
- Keyword Arguments:
labels (dict or str, optional) – Either a dictionary mapping slides to labels, or an outcome label (
str
). Used for balancing outcome labels in training and validation cohorts. If None, will not balance k-fold splits by outcome labels. Defaults to None.preserved_site (bool) – Split with site-preserved cross-validation. Defaults to False.
site_labels (dict, optional) – Dict mapping patients to site labels, or an outcome column with site labels. Only used for site preserved cross validation. Defaults to ‘site’.
splits (str, optional) – Path to JSON file containing validation splits. Defaults to None.
read_only (bool) – Prevents writing validation splits to file. Defaults to False.
- labels(self, headers: str | List[str], use_float: bool | Dict | str = False, assign: Dict[str, Dict[str, int]] | None = None, format: str = 'index') Tuple[Dict[str, str] | Dict[str, int] | Dict[str, List[float]], Dict[str, List[str] | List[float]] | List[str] | List[float]] ¶
Return a dict of slide names mapped to patient id and label(s).
- Parameters:
headers (list(str)) Annotation header(s) – May be a list or string.
use_float (bool, optional) – If true, convert data into float; if unable, raise TypeError. If false, interpret all data as categorical. If a dict(bool), look up each header to determine type. If ‘auto’, will try to convert all data into float. For each header in which this fails, will interpret as categorical.
assign (dict, optional) – Dictionary mapping label ids to label names. If not provided, will map ids to names by sorting alphabetically.
format (str, optional) – Either ‘index’ or ‘name.’ Indicates which format should be used for categorical outcomes when returning the label dictionary. If ‘name’, uses the string label name. If ‘index’, returns an int (index corresponding with the returned list of unique outcomes as str). Defaults to ‘index’.
- Returns:
A tuple containing
dict: Dictionary mapping slides to outcome labels in numerical format (float for continuous outcomes, int of outcome label id for categorical outcomes).
list: List of unique labels. For categorical outcomes, this will be a list of str; indices correspond with the outcome label id.
- load_annotations(self, annotations: str | DataFrame) None ¶
Load annotations.
- Parameters:
annotations (Union[str, pd.DataFrame]) – Either path to annotations in CSV format, or a pandas DataFrame.
- Raises:
errors.AnnotationsError – If annotations are incorrectly formatted.
- manifest(self, key: str = 'path', filter: bool = True) Dict[str, Dict[str, int]] ¶
Generate a manifest of all tfrecords.
- manifest_histogram(self, by: str | None = None, binrange: Tuple[int, int] | None = None) None ¶
Plot histograms of tiles-per-slide.
- Example
Create histograms of tiles-per-slide, stratified by site.
import matplotlib.pyplot as plt dataset.manifest_histogram(by='site') plt.show()
- get_bags(self, path, warn_missing=True)¶
Return list of all *.pt files with slide names in this dataset.
May return more than one *.pt file for each slide.
- read_tfrecord_by_location(self, slide: str, loc: Tuple[int, int], decode: bool | None = None) Any ¶
Read a record from a TFRecord, indexed by location.
Finds the associated TFRecord for a slide, and returns the record inside which corresponds to a given tile location.
- Parameters:
- Returns:
Unprocessed raw TFRecord bytes if
decode=False
, otherwise a tuple containing(slide, image)
, whereimage
is a uint8 Tensor.
- remove_filter(self, **kwargs: Any) Dataset ¶
Remove a specific filter from the active filters.
- Keyword Arguments:
- Returns:
Dataset with filter removed.
- Return type:
- rebuild_index(self) None ¶
Rebuild index files for TFRecords.
Equivalent to
Dataset.build_index(force=True)
.- Parameters:
None –
- Returns:
None
- resize_tfrecords(self, tile_px: int) None ¶
Resize images in a set of TFRecords to a given pixel size.
- Parameters:
tile_px (int) – Target pixel size for resizing TFRecord images.
- slide_manifest(self, roi_method: str = 'auto', stride_div: int = 1, tma: bool = False, source: str | None = None, low_memory: bool = False) Dict[str, int] ¶
Return a dictionary of slide names and estimated number of tiles.
Uses Otsu thresholding for background filtering, and the ROI strategy.
- Parameters:
roi_method (str) – Either ‘inside’, ‘outside’, ‘auto’, or ‘ignore’. Determines how ROIs are used to extract tiles. If ‘inside’ or ‘outside’, will extract tiles in/out of an ROI, and skip a slide if an ROI is not available. If ‘auto’, will extract tiles inside an ROI if available, and across the whole-slide if no ROI is found. If ‘ignore’, will extract tiles across the whole-slide regardless of whether an ROI is available. Defaults to ‘auto’.
stride_div (int) – Stride divisor for tile extraction. A stride of 1 will extract non-overlapping tiles. A stride_div of 2 will extract overlapping tiles, with a stride equal to 50% of the tile width. Defaults to 1.
tma (bool) – Deprecated argument. Tumor micro-arrays are read as standard slides. Defaults to False.
source (str, optional) – Dataset source name. Defaults to None (using all sources).
low_memory (bool) – Operate in low-memory mode at the cost of worse performance.
- Returns:
Dictionary mapping slide names to number of estimated non-background tiles in the slide.
- Return type:
- slide_paths(self, source: str | None = None, apply_filters: bool = True) List[str] ¶
Return a list of paths to slides.
Either returns a list of paths to all slides, or slides only matching dataset filters.
- split(self, model_type: str | None = None, labels: Dict | str | None = None, val_strategy: str = 'fixed', splits: str | None = None, val_fraction: float | None = None, val_k_fold: int | None = None, k_fold_iter: int | None = None, site_labels: str | Dict[str, str] | None = 'site', read_only: bool = False, from_wsi: bool = False) Tuple[Dataset, Dataset] ¶
Split this dataset into a training and validation dataset.
If a validation split has already been prepared (e.g. K-fold iterations were already determined), the previously generated split will be used. Otherwise, create a new split and log the result in the TFRecord directory so future models may use the same split for consistency.
- Parameters:
model_type (str) – Either ‘classification’ or ‘regression’. Defaults to ‘classification’ if
labels
is provided.labels (dict or str) – Either a dictionary of slides: labels, or an outcome label (
str
). Used for balancing outcome labels in training and validation cohorts. Defaults to None.val_strategy (str) – Either ‘k-fold’, ‘k-fold-preserved-site’, ‘bootstrap’, or ‘fixed’. Defaults to ‘fixed’.
splits (str, optional) – Path to JSON file containing validation splits. Defaults to None.
outcome_key (str, optional) – Key indicating outcome label in slide_labels_dict. Defaults to ‘outcome_label’.
val_fraction (float, optional) – Proportion of data for validation. Not used if strategy is k-fold. Defaults to None.
val_k_fold (int) – K, required if using K-fold validation. Defaults to None.
k_fold_iter (int, optional) – Which K-fold iteration to generate starting at 1. Fequired if using K-fold validation. Defaults to None.
site_labels (dict, optional) – Dict mapping patients to site labels, or an outcome column with site labels. Only used for site preserved cross validation. Defaults to ‘site’.
read_only (bool) – Prevents writing validation splits to file. Defaults to False.
- Returns:
A tuple containing
slideflow.Dataset
: Training dataset.slideflow.Dataset
: Validation dataset.
- split_tfrecords_by_roi(self, destination: str, roi_filter_method: str | float = 'center') None ¶
Split dataset tfrecords into separate tfrecords according to ROI.
Will generate two sets of tfrecords, with identical names: one with tiles inside the ROIs, one with tiles outside the ROIs. Will skip any tfrecords that are missing ROIs. Requires slides to be available.
- Parameters:
destination (str) – Destination path.
roi_filter_method (str or float) – Method of filtering tiles with ROIs. Either ‘center’ or float (0-1). If ‘center’, tiles are filtered with ROIs based on the center of the tile. If float, tiles are filtered based on the proportion of the tile inside the ROI, and
roi_filter_method
is interpreted as a threshold. If the proportion of a tile inside the ROI is greater than this number, the tile is included. For example, ifroi_filter_method=0.7
, a tile that is 80% inside of an ROI will be included, and a tile that is 50% inside of an ROI will be excluded. Defaults to ‘center’.
- Returns:
None
- tensorflow(self, labels: Dict[str, str] | Dict[str, int] | Dict[str, List[float]] = None, batch_size: int | None = None, from_wsi: bool = False, **kwargs: Any) tf.data.Dataset ¶
Return a Tensorflow Dataset object that interleaves tfrecords.
The returned dataset yields a batch of (image, label) for each tile. Labels may be specified either via a dict mapping slide names to outcomes, or a parsing function which accept and image and slide name, returning a dict {‘image_raw’: image(tensor)} and label (int or float).
- Parameters:
labels (dict or str, optional) – Dict or function. If dict, must map slide names to outcome labels. If function, function must accept an image (tensor) and slide name (str), and return a dict {‘image_raw’: image (tensor)} and label (int or float). If not provided, all labels will be None.
batch_size (int) – Batch size.
- Keyword Arguments:
Image augmentations to perform. Augmentations include:
'x'
: Random horizontal flip'y'
: Random vertical flip'r'
: Random 90-degree rotation'j'
: Random JPEG compression (50% chance to compress with quality between 50-100)'b'
: Random Gaussian blur (10% chance to blur with sigma between 0.5-2.0)'n'
: Random Stain Augmentation (requires stain normalizer)
Combine letters to define augmentations, such as
'xyrjn'
. A value of True will use'xyrjb'
.deterministic (bool, optional) – When num_parallel_calls is specified, if this boolean is specified, it controls the order in which the transformation produces elements. If set to False, the transformation is allowed to yield elements out of order to trade determinism for performance. Defaults to False.
drop_last (bool, optional) – Drop the last non-full batch. Defaults to False.
from_wsi (bool) – Generate predictions from tiles dynamically extracted from whole-slide images, rather than TFRecords. Defaults to False (use TFRecords).
incl_loc (str, optional) – ‘coord’, ‘grid’, or None. Return (x,y) origin coordinates (‘coord’) for each tile center along with tile images, or the (x,y) grid coordinates for each tile. Defaults to ‘coord’.
incl_slidenames (bool, optional) – Include slidenames as third returned variable. Defaults to False.
infinite (bool, optional) – Create an finite dataset. WARNING: If infinite is False && balancing is used, some tiles will be skipped. Defaults to True.
img_size (int) – Image width in pixels.
normalizer (
slideflow.norm.StainNormalizer
, optional) – Normalizer to use on images. Defaults to None.num_parallel_reads (int, optional) – Number of parallel reads for each TFRecordDataset. Defaults to 4.
num_shards (int, optional) – Shard the tfrecord datasets, used for multiprocessing datasets. Defaults to None.
pool (multiprocessing.Pool) – Shared multiprocessing pool. Useful if
from_wsi=True
, for sharing a unified processing pool between dataloaders. Defaults to None.rois (list(str), optional) – List of ROI paths. Only used if from_wsi=True. Defaults to None.
roi_method (str, optional) – Method for extracting ROIs. Only used if from_wsi=True. Defaults to ‘auto’.
shard_idx (int, optional) – Index of the tfrecord shard to use. Defaults to None.
standardize (bool, optional) – Standardize images to (0,1). Defaults to True.
tile_um (int, optional) – Size of tiles to extract from WSI, in microns. Only used if from_wsi=True. Defaults to None.
tfrecord_parser (Callable, optional) – Custom parser for TFRecords. Defaults to None.
transform (Callable, optional) – Arbitrary transform function. Performs transformation after augmentations but before standardization. Defaults to None.
**decode_kwargs (dict) – Keyword arguments to pass to
slideflow.io.tensorflow.decode_image()
.
- Returns:
tf.data.Dataset
- tfrecord_report(self, dest: str, normalizer: StainNormalizer | None = None) None ¶
Create a PDF report of TFRecords.
Reports include 10 example tiles per TFRecord. Report is saved in the target destination.
- Parameters:
dest (str) – Directory in which to save the PDF report.
normalizer (slideflow.norm.StainNormalizer, optional) – Normalizer to use on image tiles. Defaults to None.
- tfrecord_heatmap(self, tfrecord: str | List[str], tile_dict: Dict[int, float], filename: str, **kwargs) None ¶
Create a tfrecord-based WSI heatmap.
Uses a dictionary of tile values for heatmap display, and saves to the specified directory.
- tfrecords(self, source: str | None = None) List[str] ¶
Return a list of all tfrecords.
- Parameters:
source (str, optional) – Only return tfrecords from this dataset source. Defaults to None (return all tfrecords in dataset).
- Returns:
List of tfrecords paths.
- tfrecords_by_subfolder(self, subfolder: str) List[str] ¶
Return a list of all tfrecords in a specific subfolder.
Ignores any dataset filters.
- Parameters:
subfolder (str) – Path to subfolder to check for tfrecords.
- Returns:
List of tfrecords paths.
- tfrecords_from_tiles(self, delete_tiles: bool = False) None ¶
Create tfrecord files from a collection of raw images.
Images must be stored in the dataset source(s) tiles directory.
- Parameters:
delete_tiles (bool) – Remove tiles after storing in tfrecords.
- Returns:
None
- transform_tfrecords(self, dest: str, **kwargs) None ¶
Transform TFRecords, saving to a target path.
Tfrecords will be saved in the output directory nested by source name.
- Parameters:
dest (str) – Destination.
- thumbnails(self, outdir: str, size: int = 512, roi: bool = False, enable_downsample: bool = True) None ¶
Generate square slide thumbnails with black borders of fixed size.
Saves thumbnails to the specified directory.
- Parameters:
size (int, optional) – Width/height of thumbnail in pixels. Defaults to 512.
dataset (
slideflow.Dataset
, optional) – Dataset from which to generate activations. If not supplied, will calculate activations for all tfrecords at the tile_px/tile_um matching the supplied model, optionally using provided filters and filter_blank.filters (dict, optional) – Dataset filters to use for selecting slides. See
slideflow.Dataset.filter()
for more information. Defaults to None.filter_blank (list(str) or str, optional) – Skip slides that have blank values in these patient annotation columns. Defaults to None.
roi (bool, optional) – Include ROI in the thumbnail images. Defaults to False.
enable_downsample (bool, optional) – If True and a thumbnail is not embedded in the slide file, downsampling is permitted to accelerate thumbnail calculation.
- torch(self, labels: Dict[str, Any] | str | DataFrame | None = None, batch_size: int | None = None, rebuild_index: bool = False, from_wsi: bool = False, **kwargs: Any) DataLoader ¶
Return a PyTorch DataLoader object that interleaves tfrecords.
The returned dataloader yields a batch of (image, label) for each tile.
- Parameters:
labels (dict, str, or pd.DataFrame, optional) – If a dict is provided, expect a dict mapping slide names to outcome labels. If a str, will intepret as categorical annotation header. For regression tasks, or outcomes with manually assigned labels, pass the first result of dataset.labels(…). If None, returns slide instead of label.
batch_size (int) – Batch size.
rebuild_index (bool) – Re-build index files even if already present. Defaults to True.
- Keyword Arguments:
Image augmentations to perform. Augmentations include:
'x'
: Random horizontal flip'y'
: Random vertical flip'r'
: Random 90-degree rotation'j'
: Random JPEG compression (50% chance to compress with quality between 50-100)'b'
: Random Gaussian blur (10% chance to blur with sigma between 0.5-2.0)'n'
: Random Stain Augmentation (requires stain normalizer)
Combine letters to define augmentations, such as
'xyrjn'
. A value of True will use'xyrjb'
.chunk_size (int, optional) – Chunk size for image decoding. Defaults to 1.
drop_last (bool, optional) – Drop the last non-full batch. Defaults to False.
from_wsi (bool) – Generate predictions from tiles dynamically extracted from whole-slide images, rather than TFRecords. Defaults to False (use TFRecords).
incl_loc (bool, optional) – Include loc_x and loc_y (image tile center coordinates, in base / level=0 dimension) as additional returned variables. Defaults to False.
incl_slidenames (bool, optional) – Include slidenames as third returned variable. Defaults to False.
infinite (bool, optional) – Infinitely repeat data. Defaults to True.
max_size (bool, optional) – Unused argument present for legacy compatibility; will be removed.
model_type (str, optional) – Used to generate random labels (for StyleGAN2). Not required. Defaults to ‘classification’.
num_replicas (int, optional) – Number of GPUs or unique instances which will have their own DataLoader. Used to interleave results among workers without duplications. Defaults to 1.
num_workers (int, optional) – Number of DataLoader workers. Defaults to 2.
normalizer (
slideflow.norm.StainNormalizer
, optional) – Normalizer. Defaults to None.onehot (bool, optional) – Onehot encode labels. Defaults to False.
persistent_workers (bool, optional) – Sets the DataLoader persistent_workers flag. Defaults toNone (4 if not using a SPAMS normalizer, 1 if using SPAMS).
pin_memory (bool, optional) – Pin memory to GPU. Defaults to True.
pool (multiprocessing.Pool) – Shared multiprocessing pool. Useful if from_wsi=True, for sharing a unified processing pool between dataloaders. Defaults to None.
prefetch_factor (int, optional) – Number of batches to prefetch in each SlideflowIterator. Defaults to 1.
rank (int, optional) – Worker ID to identify this worker. Used to interleave results. among workers without duplications. Defaults to 0 (first worker).
rois (list(str), optional) – List of ROI paths. Only used if from_wsi=True. Defaults to None.
roi_method (str, optional) – Method for extracting ROIs. Only used if from_wsi=True. Defaults to ‘auto’.
standardize (bool, optional) – Standardize images to mean 0 and variance of 1. Defaults to True.
tile_um (int, optional) – Size of tiles to extract from WSI, in microns. Only used if from_wsi=True. Defaults to None.
transform (Callable, optional) – Arbitrary torchvision transform function. Performs transformation after augmentations but before standardization. Defaults to None.
tfrecord_parser (Callable, optional) – Custom parser for TFRecords. Defaults to None.
- unclip(self) Dataset ¶
Return a dataset object with all clips removed.
- Returns:
Dataset with clips removed.
- Return type:
- update_manifest(self, force_update: bool = False) None ¶
Update tfrecord manifests.
- Parameters:
forced_update (bool, optional) – Force regeneration of the manifests from scratch.
- update_annotations_with_slidenames(self, annotations_file: str) None ¶
Automatically associated slide names and paths in the annotations.
Attempts to automatically associate slide names from a directory with patients in a given annotations file, skipping any slide names that are already present in the annotations file.
- Parameters:
annotations_file (str) – Path to annotations file.
- verify_img_format(self, *, progress: bool = True) str | None ¶
Verify that all tfrecords have the same image format (PNG/JPG).
- Returns:
image format (png or jpeg)
- Return type:
- verify_slide_names(self, allow_errors: bool = False) bool ¶
Verify that slide names inside TFRecords match the file names.
- Parameters:
allow_errors (bool) – Do not raise an error if there is a mismatch. Defaults to False.
- Returns:
- If all slide names inside TFRecords match the TFRecord
file names.
- Return type:
- Raises:
sf.errors.MismatchedSlideNamesError – If any slide names inside TFRecords do not match the TFRecord file names, and allow_errors=False.