Dataset Class
DigitalTyphoonDataset
- class pyphoon2.DigitalTyphoonDataset.DigitalTyphoonDataset(image_dir: str, metadata_dir: str, metadata_json: str, labels, split_dataset_by='image', spectrum='Infrared', get_images_by_sequence=False, load_data_into_memory=False, ignore_list=None, filter_func=None, transform_func=None, transform=None, verbose=False)
Bases:
Dataset- __init__(image_dir: str, metadata_dir: str, metadata_json: str, labels, split_dataset_by='image', spectrum='Infrared', get_images_by_sequence=False, load_data_into_memory=False, ignore_list=None, filter_func=None, transform_func=None, transform=None, verbose=False) None
Dataloader for the DigitalTyphoon dataset.
- Parameters
image_dir – Path to directory containing directories of typhoon sequences
metadata_dir – Path to directory containing track data for typhoon sequences
metadata_json – Path to the metadata JSON file
split_dataset_by – What unit to treat as an atomic unit when randomly splitting the dataset. Options are “sequence”, “season”, or “image” (individual image)
spectrum – Spectrum to access h5 image files with
get_images_by_sequence – Boolean representing if an index should refer to an individual image or an entire sequence. If sequence, returned images are Lists of images.
load_data_into_memory – String representing if the images and track data should be entirely loaded into memory. Options are “track” (only track data), “images” (only images), or “all_data” (both track and images).
ignore_list – a list of filenames (not path) to ignore and NOT add to the dataset
filter_func – a function used to filter out images from the dataset. Should accept an DigitalTyphoonImage object and return a bool True or False if it should be included in the dataset
transform_func – this function will be called on the image array for each image when reading in the dataset. It should take and return a numpy image array
transform – Pytorch transform func. Will be called on the tuple of (image/sequence, label array). It should take in said tuple, and return a tuple of (transformed image/sequence, transformed label)
verbose – Print verbose program information
- set_label(label_strs) None
Sets what label to retrieve when accessing the data set via dataset[idx] or dataset.__getitem__(idx) Options are: season, month, day, hour, grade, lat, lng, pressure, wind, dir50, long50, short50, dir30, long30, short30, landfall, interpolated
- Parameters
label_strs – a single string (e.g. ‘grade’) or a list/tuple of strings (e.g. [‘lat’, ‘lng’]) of labels.
- Returns
None
- random_split(lengths: ~typing.Sequence[~typing.Union[int, float]], split_by=None, generator: ~typing.Optional[~torch._C.Generator] = <torch._C.Generator object>) List[Subset]
Randomly split a dataset into non-overlapping new datasets of given lengths.
Given a list of proportions or items, returns a random split of the dataset with proportions as close to the requested without causing leakage between requested split_unit. If split is by image, built-in PyTorch function is used. If split is by season, all images from typhoons starting in the same season will be placed in the same bucket. If split is by seq_str, all images from the same typhoon will be together.
Returns a list of Subsets of indices according to requested lengths. If split is anything other than image, indices within their split unit are not randomized. (I.e. indices of a seq_str will be kept contiguous, not randomized order mixing with other sequences).
If “get_images_by_sequence” is set to True on initialization, split_by image and sequence are functionally identical, and will split the number of sequences into the requested bucket sizes. If split_by=’season’, then sequences with the same season will be placed in the same bucket.
Only non-empty sequences are returned in the split.
For Subset doc see https://pytorch.org/docs/stable/data.html#torch.utils.data.Subset.
- Parameters
lengths – lengths or fractions of splits to be produced
generator – Generator used for the random permutation.
split_by – What to treat as an atomic unit (image, seq_str, season). Options are “image”, “sequence” or “season” respectively
- Returns
List[Subset[idx]]
- images_from_season(season: int) Subset
Given a start season, return a Subset (Dataset) object containing all the images from that season, in order
- Parameters
season – the start season as a string
- Returns
Subset
- image_objects_from_season(season: int) List
Given a start season, return a list of DigitalTyphoonImage objects for images from that season
- Parameters
season – the start season as a string
- Returns
List[DigitalTyphoonImage]
- images_from_seasons(seasons: List[int])
Given a list of seasons, returns a dataset Subset containing all images from those seasons, in order
- Parameters
seasons – List of season integers
- Returns
Subset
- images_from_sequence(sequence_str: str) Subset
Given a sequence ID, returns a Subset of the dataset of the images in that sequence
- Parameters
sequence_str – str, the sequence ID
- Returns
Subset of the total dataset
- image_objects_from_sequence(sequence_str: str) List
Given a sequence ID, returns a list of the DigitalTyphoonImage objects in the sequence in chronological order.
- Parameters
sequence_str –
- Returns
List[DigitalTyphoonImage]
- images_from_sequences(sequence_strs: List[str]) Subset
Given a list of sequence IDs, returns a dataset Subset containing all the images within the sequences, in order
- Parameters
sequence_strs – List[str], the sequence IDs
- Returns
Subset of the total dataset
- images_as_tensor(indices: List[int]) Tensor
Given a list of dataset indices, returns the images as a Torch Tensor
- Parameters
indices – List[int]
- Returns
torch Tensor
- labels_as_tensor(indices: List[int], label: str) Tensor
Given a list of dataset indices, returns the specified labels as a Torch Tensor
- Parameters
indices – List[int]
label – str, denoting which label to retrieve
- Returns
torch Tensor
- get_number_of_sequences()
Gets number of sequences (typhoons) in the dataset
- Returns
integer number of sequences
- get_number_of_nonempty_sequences()
Gets number of sequences (typhoons) in the dataset that have at least 1 image
- Returns
integer number of sequences
- get_sequence_ids() List[str]
Returns a list of the sequence ID’s in the dataset, as strings
- Returns
List[str]
- get_seasons() List[int]
Returns a list of the seasons that typhoons have started in chronological order
- Returns
List[int]
- get_nonempty_seasons() List[int]
Returns a list of the seasons that typhoons have started in, that have at least one image, in chronological order
- Returns
List[int]
- sequence_exists(seq_str: str) bool
Returns if a seq_str with given seq_str number exists in the dataset
- Parameters
seq_str – string of the seq_str ID
- Returns
Boolean True if present, False otherwise
- get_ith_sequence(idx: int) DigitalTyphoonSequence
Given an index idx, returns the idx’th sequence in the dataset
- Parameters
idx – int index
- Returns
DigitalTyphoonSequence
- process_metadata_file(filepath: str)
Reads and processes JSON metadata file’s information into dataset.
- Parameters
filepath – path to metadata file
- Returns
metadata JSON object
- get_seq_ids_from_season(season: int) List[str]
Given a start season, give the sequence ID strings of all sequences that start in that season.
- Parameters
season – the start season as a string
- Returns
a list of the sequence IDs starting in that season
- total_image_idx_to_sequence_idx(total_idx: int) int
Given a total dataset image index, returns that image’s index in its respective sequence. e.g. an image that is the 500th in the total dataset may be the 5th image in its sequence.
- Parameters
total_idx – the total dataset image index
- Returns
the inner-sequence image index.
- seq_idx_to_total_image_idx(seq_str: str, seq_idx: int) int
Given an image with seq_idx position within its sequence, return its total idx within the greater dataset. e.g. an image that is the 5th image in the sequence may be the 500th in the total dataset.
- Parameters
seq_str – The sequence ID string to search within
seq_idx – int, the index within the given sequence
- Returns
int, the total index within the dataset
- seq_indices_to_total_indices(seq_obj: DigitalTyphoonSequence) List[int]
Given a sequence, return a list of the total dataset indices of the sequence’s images.
- Parameters
seq_obj – the DigitalTyphoonSequence object to produce the list from
- Returns
the List of total dataset indices
- get_image_from_idx(idx) DigitalTyphoonImage
Given a dataset image idx, returns the image object from that index.
- Parameters
idx – int, the total dataset image idx
- Returns
DigitalTyphoonImage object for that image