pyphoon2 package

Submodules

pyphoon2.DigitalTyphoonDataset module

class pyphoon2.DigitalTyphoonDataset.DigitalTyphoonDataset(image_dir: str, metadata_dir: str, metadata_json: str, labels, split_dataset_by='image', spectrum='Infrared', get_images_by_sequence=False, load_data_into_memory=False, ignore_list=None, filter_func=None, transform_func=None, transform=None, verbose=False)

Bases: Dataset

__init__(image_dir: str, metadata_dir: str, metadata_json: str, labels, split_dataset_by='image', spectrum='Infrared', get_images_by_sequence=False, load_data_into_memory=False, ignore_list=None, filter_func=None, transform_func=None, transform=None, verbose=False) None

Dataloader for the DigitalTyphoon dataset.

Parameters
  • image_dir – Path to directory containing directories of typhoon sequences

  • metadata_dir – Path to directory containing track data for typhoon sequences

  • metadata_json – Path to the metadata JSON file

  • split_dataset_by – What unit to treat as an atomic unit when randomly splitting the dataset. Options are “sequence”, “season”, or “image” (individual image)

  • spectrum – Spectrum to access h5 image files with

  • get_images_by_sequence – Boolean representing if an index should refer to an individual image or an entire sequence. If sequence, returned images are Lists of images.

  • load_data_into_memory – String representing if the images and track data should be entirely loaded into memory. Options are “track” (only track data), “images” (only images), or “all_data” (both track and images).

  • ignore_list – a list of filenames (not path) to ignore and NOT add to the dataset

  • filter_func – a function used to filter out images from the dataset. Should accept an DigitalTyphoonImage object and return a bool True or False if it should be included in the dataset

  • transform_func – this function will be called on the image array for each image when reading in the dataset. It should take and return a numpy image array

  • transform – Pytorch transform func. Will be called on the tuple of (image/sequence, label array). It should take in said tuple, and return a tuple of (transformed image/sequence, transformed label)

  • verbose – Print verbose program information

set_label(label_strs) None

Sets what label to retrieve when accessing the data set via dataset[idx] or dataset.__getitem__(idx) Options are: season, month, day, hour, grade, lat, lng, pressure, wind, dir50, long50, short50, dir30, long30, short30, landfall, interpolated

Parameters

label_strs – a single string (e.g. ‘grade’) or a list/tuple of strings (e.g. [‘lat’, ‘lng’]) of labels.

Returns

None

random_split(lengths: ~typing.Sequence[~typing.Union[int, float]], split_by=None, generator: ~typing.Optional[~torch._C.Generator] = <torch._C.Generator object>) List[Subset]

Randomly split a dataset into non-overlapping new datasets of given lengths.

Given a list of proportions or items, returns a random split of the dataset with proportions as close to the requested without causing leakage between requested split_unit. If split is by image, built-in PyTorch function is used. If split is by season, all images from typhoons starting in the same season will be placed in the same bucket. If split is by seq_str, all images from the same typhoon will be together.

Returns a list of Subsets of indices according to requested lengths. If split is anything other than image, indices within their split unit are not randomized. (I.e. indices of a seq_str will be kept contiguous, not randomized order mixing with other sequences).

If “get_images_by_sequence” is set to True on initialization, split_by image and sequence are functionally identical, and will split the number of sequences into the requested bucket sizes. If split_by=’season’, then sequences with the same season will be placed in the same bucket.

Only non-empty sequences are returned in the split.

For Subset doc see https://pytorch.org/docs/stable/data.html#torch.utils.data.Subset.

Parameters
  • lengths – lengths or fractions of splits to be produced

  • generator – Generator used for the random permutation.

  • split_by – What to treat as an atomic unit (image, seq_str, season). Options are “image”, “sequence” or “season” respectively

Returns

List[Subset[idx]]

images_from_season(season: int) Subset

Given a start season, return a Subset (Dataset) object containing all the images from that season, in order

Parameters

season – the start season as a string

Returns

Subset

image_objects_from_season(season: int) List

Given a start season, return a list of DigitalTyphoonImage objects for images from that season

Parameters

season – the start season as a string

Returns

List[DigitalTyphoonImage]

images_from_seasons(seasons: List[int])

Given a list of seasons, returns a dataset Subset containing all images from those seasons, in order

Parameters

seasons – List of season integers

Returns

Subset

images_from_sequence(sequence_str: str) Subset

Given a sequence ID, returns a Subset of the dataset of the images in that sequence

Parameters

sequence_str – str, the sequence ID

Returns

Subset of the total dataset

image_objects_from_sequence(sequence_str: str) List

Given a sequence ID, returns a list of the DigitalTyphoonImage objects in the sequence in chronological order.

Parameters

sequence_str

Returns

List[DigitalTyphoonImage]

images_from_sequences(sequence_strs: List[str]) Subset

Given a list of sequence IDs, returns a dataset Subset containing all the images within the sequences, in order

Parameters

sequence_strs – List[str], the sequence IDs

Returns

Subset of the total dataset

images_as_tensor(indices: List[int]) Tensor

Given a list of dataset indices, returns the images as a Torch Tensor

Parameters

indices – List[int]

Returns

torch Tensor

labels_as_tensor(indices: List[int], label: str) Tensor

Given a list of dataset indices, returns the specified labels as a Torch Tensor

Parameters
  • indices – List[int]

  • label – str, denoting which label to retrieve

Returns

torch Tensor

get_number_of_sequences()

Gets number of sequences (typhoons) in the dataset

Returns

integer number of sequences

get_number_of_nonempty_sequences()

Gets number of sequences (typhoons) in the dataset that have at least 1 image

Returns

integer number of sequences

get_sequence_ids() List[str]

Returns a list of the sequence ID’s in the dataset, as strings

Returns

List[str]

get_seasons() List[int]

Returns a list of the seasons that typhoons have started in chronological order

Returns

List[int]

get_nonempty_seasons() List[int]

Returns a list of the seasons that typhoons have started in, that have at least one image, in chronological order

Returns

List[int]

sequence_exists(seq_str: str) bool

Returns if a seq_str with given seq_str number exists in the dataset

Parameters

seq_str – string of the seq_str ID

Returns

Boolean True if present, False otherwise

get_ith_sequence(idx: int) DigitalTyphoonSequence

Given an index idx, returns the idx’th sequence in the dataset

Parameters

idx – int index

Returns

DigitalTyphoonSequence

process_metadata_file(filepath: str)

Reads and processes JSON metadata file’s information into dataset.

Parameters

filepath – path to metadata file

Returns

metadata JSON object

get_seq_ids_from_season(season: int) List[str]

Given a start season, give the sequence ID strings of all sequences that start in that season.

Parameters

season – the start season as a string

Returns

a list of the sequence IDs starting in that season

total_image_idx_to_sequence_idx(total_idx: int) int

Given a total dataset image index, returns that image’s index in its respective sequence. e.g. an image that is the 500th in the total dataset may be the 5th image in its sequence.

Parameters

total_idx – the total dataset image index

Returns

the inner-sequence image index.

seq_idx_to_total_image_idx(seq_str: str, seq_idx: int) int

Given an image with seq_idx position within its sequence, return its total idx within the greater dataset. e.g. an image that is the 5th image in the sequence may be the 500th in the total dataset.

Parameters
  • seq_str – The sequence ID string to search within

  • seq_idx – int, the index within the given sequence

Returns

int, the total index within the dataset

seq_indices_to_total_indices(seq_obj: DigitalTyphoonSequence) List[int]

Given a sequence, return a list of the total dataset indices of the sequence’s images.

Parameters

seq_obj – the DigitalTyphoonSequence object to produce the list from

Returns

the List of total dataset indices

get_image_from_idx(idx) DigitalTyphoonImage

Given a dataset image idx, returns the image object from that index.

Parameters

idx – int, the total dataset image idx

Returns

DigitalTyphoonImage object for that image

pyphoon2.DigitalTyphoonImage module

class pyphoon2.DigitalTyphoonImage.DigitalTyphoonImage(image_filepath: str, track_entry: ndarray, sequence_id=None, load_imgs_into_mem=False, transform_func=None, spectrum='Infrared')

Bases: object

__init__(image_filepath: str, track_entry: ndarray, sequence_id=None, load_imgs_into_mem=False, transform_func=None, spectrum='Infrared')

Class for one image with metadata for the DigitalTyphoonDataset

Does NOT check for file existence until accessing the image.

Parameters
  • image_filepath – str, path to image file

  • track_entry – np.ndarray, 1d numpy array for the track csv entry corresponding to the image

  • load_imgs_into_mem – bool, flag indicating whether images should be loaded into memory

  • spectrum – str, default spectrum to read the image in

param transform_func: this function will be called on the image array when the array is accessed (or read into memory).

It should take and return a numpy image array

image(spectrum=None) ndarray

Returns the image as a numpy array. If load_imgs_into_mem was set to true, it will cache the image

Parameters

spectrum – spectrum (channel) the image was taken in

Returns

np.ndarray, the image

sequence_id() str

Returns the sequence ID this image belongs to

Returns

str sequence str

track_array() ndarray

Returns the csv row for this image

Returns

nparray containing the track data

value_from_string(label)

Returns the image’s value given the label as a string. e.g. value_from_string(‘month’) -> the month

Returns

the element

year() int

Returns the year the image was taken

Returns

int, the year

month() int

Returns the month the image was taken

Returns

int, the month (1-12)

day() int

Returns the day the image was taken (number 1-31)

Returns

int the day

hour() int

Returns the hour the image was taken

Returns

int, the hour

datetime() datetime

Returns a datetime object of when the image was taken

Returns

datetime

grade() int

Returns the grade of the typhoon in the image

Returns

int, the grade

lat() float

Returns the latitude of the image

Returns

float

long() float

Returns the longitude of the image

Returns

float

pressure() float

Returns the pressure in # TODO: units? probably hg

Returns

float

wind() float

Returns the wind speed in # TODO: units?

Returns

float

dir50() float

# TODO: what is this?

Returns

float

long50() float

# TODO: what is this?

Returns

float

short50() float

# TODO: what is this?

Returns

float

dir30() float

# TODO: what is this?

Returns

float

long30() float

# TODO: what is this?

Returns

float

short30() float

# TODO: what is this?

Returns

float

landfall() float

# TODO: what is this?

Returns

float

interpolated() bool

Returns whether this entry is interpolated or not

Returns

bool

filepath() str

Returns the filepath to the image

Returns

str

mask_1() float

Returns number of pixels in the image that are corrupted

Returns

float the number of pixels

mask_1_percent() float

Returns percentage of pixels in the image that are corrupted

Returns

float the percentage of pixels

set_track_data(track_entry: ndarray) None

Sets the track entry

Parameters

track_entry – numpy array representing one entry of the track csv

Returns

None

set_image_data(image_filepath: str, load_imgs_into_mem=False, spectrum=None) None

Sets the image file data

Parameters
  • load_imgs_into_mem – bool, whether to load images into memory

  • spectrum – str, spectrum to open h5 images with

  • image_filepath – string to image

Returns

None

pyphoon2.DigitalTyphoonSequence module

class pyphoon2.DigitalTyphoonSequence.DigitalTyphoonSequence(seq_str: str, start_season: int, num_images: int, transform_func=None, spectrum='Infrared', verbose=False)

Bases: object

__init__(seq_str: str, start_season: int, num_images: int, transform_func=None, spectrum='Infrared', verbose=False)

Class representing one typhoon sequence from the DigitalTyphoon dataset

Parameters
  • seq_str – str, sequence ID as a string

  • start_season – int, the season in which the typhoon starts in

  • num_images – int, number of images in the sequence

  • transform_func – this function will be called on each image before saving it/returning it. It should take and return a np array

get_sequence_str() str

Returns the sequence ID as a string

Returns

string sequence ID

process_seq_img_dir_into_sequence(directory_path: str, load_imgs_into_mem=False, ignore_list=None, spectrum=None, filter_func=<function DigitalTyphoonSequence.<lambda>>) None

Given a path to a directory containing images of a typhoon sequence, process the images into the current sequence object. If ‘load_imgs_into_mem’ is set to True, the images will be read as numpy arrays and stored in memory. Spectrum refers to what light spectrum the image lies in.

Parameters
  • directory_path – Path to the typhoon sequence directory

  • load_imgs_into_mem – Bool representing if images should be loaded into memory

  • ignore_list – list of image filenames to ignore

  • spectrum – string representing what spectrum the image lies in

  • filter_func – function that accepts an image and returns True or False if it should be included in the sequence

Returns

None

get_start_season() int

Get the start season of the sequence

Returns

int, the start season

get_num_images() int

Gets the number of images in the sequence

Returns

int

get_num_original_images() int

Get the number of images in the sequence

Returns

int, the number of images

has_images() bool

Returns true if the sequence currently holds images (or image filepaths). False otherwise.

Returns

bool

process_track_data(track_filepath: str, csv_delimiter=',') None

Takes in the track data for the sequence and processes it into the images for the sequence.

Parameters
  • track_filepath – str, path to track csv

  • csv_delimiter – delimiter for the csv file

Returns

None

add_track_data(filename: str, csv_delimiter=',') None

Reads and adds the track data to the sequence.

Parameters
  • filename – str, path to the track data

  • csv_delimiter – char, delimiter to use to read the csv

Returns

None

set_track_path(track_path: str) None

Sets the path to the track data file

Parameters

track_path – str, filepath to the track data

Returns

None

get_track_path() str

Gets the path to the track data file

Returns

str, the path to the track data file

get_track_data() ndarray

Returns the track csv data as a numpy array, with each row corresponding to a row in the CSV.

Returns

np.ndarray

get_image_at_idx(idx: int, spectrum='Infrared') DigitalTyphoonImage

Returns the idx’th DigitalTyphoonImage in the sequence. raises an exception if the idx is out of the the sequence’s range

Parameters
  • idx – int, idx to access

  • spectrum – str, spectrum of the image

Returns

DigitalTyphoonImage, the image object

get_image_at_idx_as_numpy(idx: int, spectrum=None) ndarray

Gets the idx’th image in the sequence as a numpy array. Raises an exception if the idx is outside of the sequence’s range.

Parameters
  • idx – int, idx to access

  • spectrum – str, spectrum of the image

Returns

np.ndarray, image as a numpy array with shape of the image dimensions

get_all_images_in_sequence() List[DigitalTyphoonImage]

Returns all of the image objects (DigitalTyphoonImage) in the sequence in order.

Returns

List[DigitalTyphoonImage]

return_all_images_in_sequence_as_np(spectrum=None) ndarray

Returns all the images in a sequence as a numpy array of shape (num_images, image_shape[0], image_shape[1])

Parameters

spectrum – str, spectrum of the image

Returns

np.ndarray of shape (num_image, image_shape[0], image_shape[1])

num_images_match_num_expected() bool

Returns True if the number of image filepaths stored matches the number of images stated when initializing the sequence object. False otherwise.

Returns

bool

get_image_filepaths() List[str]

Returns a list of the filenames of the images (without the root path)

Returns

List[str], list of the filenames

set_images_root_path(images_root_path: str) None

Sets the root path of the images.

Parameters

images_root_path – str, the root path

Returns

None

get_images_root_path() str

Gets the root path to the image directory

Returns

str, the root path

pyphoon2.DigitalTyphoonUtils module

class pyphoon2.DigitalTyphoonUtils.SPLIT_UNIT(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: Enum

Enum denoting which unit to treat as atomic when splitting the dataset

SEQUENCE = 'sequence'
SEASON = 'season'
IMAGE = 'image'
classmethod has_value(value)

Returns true if value is present in the enum

Parameters

value – str, the value to check for

Returns

bool

class pyphoon2.DigitalTyphoonUtils.LOAD_DATA(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: Enum

Enum denoting what level of data should be stored in memory

NO_DATA = False
ONLY_TRACK = 'track'
ONLY_IMG = 'images'
ALL_DATA = 'all_data'
classmethod has_value(value)
class pyphoon2.DigitalTyphoonUtils.TRACK_COLS(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)

Bases: Enum

Enum containing indices in a track csv col to find the respective data

YEAR = 0
MONTH = 1
DAY = 2
HOUR = 3
GRADE = 4
LAT = 5
LNG = 6
PRESSURE = 7
WIND = 8
DIR50 = 9
LONG50 = 10
SHORT50 = 11
DIR30 = 12
LONG30 = 13
SHORT30 = 14
LANDFALL = 15
INTERPOLATED = 16
FILENAME = 17
MASK_1 = 18
MASK_1_PERCENT = 19
classmethod str_to_value(name)
classmethod has_value(value)
pyphoon2.DigitalTyphoonUtils.parse_image_filename(filename: str, separator='-') -> (<class 'str'>, <class 'datetime.datetime'>, <class 'str'>)

Takes the filename of a Digital Typhoon image and parses it to return the date it was taken, the sequence ID it belongs to, and the satellite that took the image

Parameters
  • filename – str, filename of the image

  • separator – char, separator used in the filename

Returns

(str, datetime, str), Tuple containing the sequence ID, the datetime, and satellite string

pyphoon2.DigitalTyphoonUtils.get_seq_str_from_track_filename(filename: str) str

Given a track filename, returns the sequence ID it belongs to

Parameters

filename – str, the filename

Returns

str, the sequence ID string

pyphoon2.DigitalTyphoonUtils.is_image_file(filename: str) bool

Given a DigitalTyphoon file, returns if it is an h5 image.

Parameters

filename – str, the filename

Returns

bool, True if it is an h5 image, False otherwise

Module contents