Dataset Class

DigitalTyphoonDataset

class pyphoon2.DigitalTyphoonDataset.DigitalTyphoonDataset(image_dir: str, metadata_dir: str, metadata_json: str, labels, split_dataset_by='image', spectrum='Infrared', get_images_by_sequence=False, load_data_into_memory=False, ignore_list=None, filter_func=None, transform_func=None, transform=None, verbose=False)

Bases: Dataset

__init__(image_dir: str, metadata_dir: str, metadata_json: str, labels, split_dataset_by='image', spectrum='Infrared', get_images_by_sequence=False, load_data_into_memory=False, ignore_list=None, filter_func=None, transform_func=None, transform=None, verbose=False) None

Dataloader for the DigitalTyphoon dataset.

Parameters
  • image_dir – Path to directory containing directories of typhoon sequences

  • metadata_dir – Path to directory containing track data for typhoon sequences

  • metadata_json – Path to the metadata JSON file

  • split_dataset_by – What unit to treat as an atomic unit when randomly splitting the dataset. Options are “sequence”, “season”, or “image” (individual image)

  • spectrum – Spectrum to access h5 image files with

  • get_images_by_sequence – Boolean representing if an index should refer to an individual image or an entire sequence. If sequence, returned images are Lists of images.

  • load_data_into_memory – String representing if the images and track data should be entirely loaded into memory. Options are “track” (only track data), “images” (only images), or “all_data” (both track and images).

  • ignore_list – a list of filenames (not path) to ignore and NOT add to the dataset

  • filter_func – a function used to filter out images from the dataset. Should accept an DigitalTyphoonImage object and return a bool True or False if it should be included in the dataset

  • transform_func – this function will be called on the image array for each image when reading in the dataset. It should take and return a numpy image array

  • transform – Pytorch transform func. Will be called on the tuple of (image/sequence, label array). It should take in said tuple, and return a tuple of (transformed image/sequence, transformed label)

  • verbose – Print verbose program information

set_label(label_strs) None

Sets what label to retrieve when accessing the data set via dataset[idx] or dataset.__getitem__(idx) Options are: season, month, day, hour, grade, lat, lng, pressure, wind, dir50, long50, short50, dir30, long30, short30, landfall, interpolated

Parameters

label_strs – a single string (e.g. ‘grade’) or a list/tuple of strings (e.g. [‘lat’, ‘lng’]) of labels.

Returns

None

random_split(lengths: ~typing.Sequence[~typing.Union[int, float]], split_by=None, generator: ~typing.Optional[~torch._C.Generator] = <torch._C.Generator object>) List[Subset]

Randomly split a dataset into non-overlapping new datasets of given lengths.

Given a list of proportions or items, returns a random split of the dataset with proportions as close to the requested without causing leakage between requested split_unit. If split is by image, built-in PyTorch function is used. If split is by season, all images from typhoons starting in the same season will be placed in the same bucket. If split is by seq_str, all images from the same typhoon will be together.

Returns a list of Subsets of indices according to requested lengths. If split is anything other than image, indices within their split unit are not randomized. (I.e. indices of a seq_str will be kept contiguous, not randomized order mixing with other sequences).

If “get_images_by_sequence” is set to True on initialization, split_by image and sequence are functionally identical, and will split the number of sequences into the requested bucket sizes. If split_by=’season’, then sequences with the same season will be placed in the same bucket.

Only non-empty sequences are returned in the split.

For Subset doc see https://pytorch.org/docs/stable/data.html#torch.utils.data.Subset.

Parameters
  • lengths – lengths or fractions of splits to be produced

  • generator – Generator used for the random permutation.

  • split_by – What to treat as an atomic unit (image, seq_str, season). Options are “image”, “sequence” or “season” respectively

Returns

List[Subset[idx]]

images_from_season(season: int) Subset

Given a start season, return a Subset (Dataset) object containing all the images from that season, in order

Parameters

season – the start season as a string

Returns

Subset

image_objects_from_season(season: int) List

Given a start season, return a list of DigitalTyphoonImage objects for images from that season

Parameters

season – the start season as a string

Returns

List[DigitalTyphoonImage]

images_from_seasons(seasons: List[int])

Given a list of seasons, returns a dataset Subset containing all images from those seasons, in order

Parameters

seasons – List of season integers

Returns

Subset

images_from_sequence(sequence_str: str) Subset

Given a sequence ID, returns a Subset of the dataset of the images in that sequence

Parameters

sequence_str – str, the sequence ID

Returns

Subset of the total dataset

image_objects_from_sequence(sequence_str: str) List

Given a sequence ID, returns a list of the DigitalTyphoonImage objects in the sequence in chronological order.

Parameters

sequence_str

Returns

List[DigitalTyphoonImage]

images_from_sequences(sequence_strs: List[str]) Subset

Given a list of sequence IDs, returns a dataset Subset containing all the images within the sequences, in order

Parameters

sequence_strs – List[str], the sequence IDs

Returns

Subset of the total dataset

images_as_tensor(indices: List[int]) Tensor

Given a list of dataset indices, returns the images as a Torch Tensor

Parameters

indices – List[int]

Returns

torch Tensor

labels_as_tensor(indices: List[int], label: str) Tensor

Given a list of dataset indices, returns the specified labels as a Torch Tensor

Parameters
  • indices – List[int]

  • label – str, denoting which label to retrieve

Returns

torch Tensor

get_number_of_sequences()

Gets number of sequences (typhoons) in the dataset

Returns

integer number of sequences

get_number_of_nonempty_sequences()

Gets number of sequences (typhoons) in the dataset that have at least 1 image

Returns

integer number of sequences

get_sequence_ids() List[str]

Returns a list of the sequence ID’s in the dataset, as strings

Returns

List[str]

get_seasons() List[int]

Returns a list of the seasons that typhoons have started in chronological order

Returns

List[int]

get_nonempty_seasons() List[int]

Returns a list of the seasons that typhoons have started in, that have at least one image, in chronological order

Returns

List[int]

sequence_exists(seq_str: str) bool

Returns if a seq_str with given seq_str number exists in the dataset

Parameters

seq_str – string of the seq_str ID

Returns

Boolean True if present, False otherwise

get_ith_sequence(idx: int) DigitalTyphoonSequence

Given an index idx, returns the idx’th sequence in the dataset

Parameters

idx – int index

Returns

DigitalTyphoonSequence

process_metadata_file(filepath: str)

Reads and processes JSON metadata file’s information into dataset.

Parameters

filepath – path to metadata file

Returns

metadata JSON object

get_seq_ids_from_season(season: int) List[str]

Given a start season, give the sequence ID strings of all sequences that start in that season.

Parameters

season – the start season as a string

Returns

a list of the sequence IDs starting in that season

total_image_idx_to_sequence_idx(total_idx: int) int

Given a total dataset image index, returns that image’s index in its respective sequence. e.g. an image that is the 500th in the total dataset may be the 5th image in its sequence.

Parameters

total_idx – the total dataset image index

Returns

the inner-sequence image index.

seq_idx_to_total_image_idx(seq_str: str, seq_idx: int) int

Given an image with seq_idx position within its sequence, return its total idx within the greater dataset. e.g. an image that is the 5th image in the sequence may be the 500th in the total dataset.

Parameters
  • seq_str – The sequence ID string to search within

  • seq_idx – int, the index within the given sequence

Returns

int, the total index within the dataset

seq_indices_to_total_indices(seq_obj: DigitalTyphoonSequence) List[int]

Given a sequence, return a list of the total dataset indices of the sequence’s images.

Parameters

seq_obj – the DigitalTyphoonSequence object to produce the list from

Returns

the List of total dataset indices

get_image_from_idx(idx) DigitalTyphoonImage

Given a dataset image idx, returns the image object from that index.

Parameters

idx – int, the total dataset image idx

Returns

DigitalTyphoonImage object for that image