pyphoon2 package
Submodules
pyphoon2.DigitalTyphoonDataset module
- class pyphoon2.DigitalTyphoonDataset.DigitalTyphoonDataset(image_dir: str, metadata_dir: str, metadata_json: str, labels, split_dataset_by='image', spectrum='Infrared', get_images_by_sequence=False, load_data_into_memory=False, ignore_list=None, filter_func=None, transform_func=None, transform=None, verbose=False)
Bases:
Dataset- __init__(image_dir: str, metadata_dir: str, metadata_json: str, labels, split_dataset_by='image', spectrum='Infrared', get_images_by_sequence=False, load_data_into_memory=False, ignore_list=None, filter_func=None, transform_func=None, transform=None, verbose=False) None
Dataloader for the DigitalTyphoon dataset.
- Parameters
image_dir – Path to directory containing directories of typhoon sequences
metadata_dir – Path to directory containing track data for typhoon sequences
metadata_json – Path to the metadata JSON file
split_dataset_by – What unit to treat as an atomic unit when randomly splitting the dataset. Options are “sequence”, “season”, or “image” (individual image)
spectrum – Spectrum to access h5 image files with
get_images_by_sequence – Boolean representing if an index should refer to an individual image or an entire sequence. If sequence, returned images are Lists of images.
load_data_into_memory – String representing if the images and track data should be entirely loaded into memory. Options are “track” (only track data), “images” (only images), or “all_data” (both track and images).
ignore_list – a list of filenames (not path) to ignore and NOT add to the dataset
filter_func – a function used to filter out images from the dataset. Should accept an DigitalTyphoonImage object and return a bool True or False if it should be included in the dataset
transform_func – this function will be called on the image array for each image when reading in the dataset. It should take and return a numpy image array
transform – Pytorch transform func. Will be called on the tuple of (image/sequence, label array). It should take in said tuple, and return a tuple of (transformed image/sequence, transformed label)
verbose – Print verbose program information
- set_label(label_strs) None
Sets what label to retrieve when accessing the data set via dataset[idx] or dataset.__getitem__(idx) Options are: season, month, day, hour, grade, lat, lng, pressure, wind, dir50, long50, short50, dir30, long30, short30, landfall, interpolated
- Parameters
label_strs – a single string (e.g. ‘grade’) or a list/tuple of strings (e.g. [‘lat’, ‘lng’]) of labels.
- Returns
None
- random_split(lengths: ~typing.Sequence[~typing.Union[int, float]], split_by=None, generator: ~typing.Optional[~torch._C.Generator] = <torch._C.Generator object>) List[Subset]
Randomly split a dataset into non-overlapping new datasets of given lengths.
Given a list of proportions or items, returns a random split of the dataset with proportions as close to the requested without causing leakage between requested split_unit. If split is by image, built-in PyTorch function is used. If split is by season, all images from typhoons starting in the same season will be placed in the same bucket. If split is by seq_str, all images from the same typhoon will be together.
Returns a list of Subsets of indices according to requested lengths. If split is anything other than image, indices within their split unit are not randomized. (I.e. indices of a seq_str will be kept contiguous, not randomized order mixing with other sequences).
If “get_images_by_sequence” is set to True on initialization, split_by image and sequence are functionally identical, and will split the number of sequences into the requested bucket sizes. If split_by=’season’, then sequences with the same season will be placed in the same bucket.
Only non-empty sequences are returned in the split.
For Subset doc see https://pytorch.org/docs/stable/data.html#torch.utils.data.Subset.
- Parameters
lengths – lengths or fractions of splits to be produced
generator – Generator used for the random permutation.
split_by – What to treat as an atomic unit (image, seq_str, season). Options are “image”, “sequence” or “season” respectively
- Returns
List[Subset[idx]]
- images_from_season(season: int) Subset
Given a start season, return a Subset (Dataset) object containing all the images from that season, in order
- Parameters
season – the start season as a string
- Returns
Subset
- image_objects_from_season(season: int) List
Given a start season, return a list of DigitalTyphoonImage objects for images from that season
- Parameters
season – the start season as a string
- Returns
List[DigitalTyphoonImage]
- images_from_seasons(seasons: List[int])
Given a list of seasons, returns a dataset Subset containing all images from those seasons, in order
- Parameters
seasons – List of season integers
- Returns
Subset
- images_from_sequence(sequence_str: str) Subset
Given a sequence ID, returns a Subset of the dataset of the images in that sequence
- Parameters
sequence_str – str, the sequence ID
- Returns
Subset of the total dataset
- image_objects_from_sequence(sequence_str: str) List
Given a sequence ID, returns a list of the DigitalTyphoonImage objects in the sequence in chronological order.
- Parameters
sequence_str –
- Returns
List[DigitalTyphoonImage]
- images_from_sequences(sequence_strs: List[str]) Subset
Given a list of sequence IDs, returns a dataset Subset containing all the images within the sequences, in order
- Parameters
sequence_strs – List[str], the sequence IDs
- Returns
Subset of the total dataset
- images_as_tensor(indices: List[int]) Tensor
Given a list of dataset indices, returns the images as a Torch Tensor
- Parameters
indices – List[int]
- Returns
torch Tensor
- labels_as_tensor(indices: List[int], label: str) Tensor
Given a list of dataset indices, returns the specified labels as a Torch Tensor
- Parameters
indices – List[int]
label – str, denoting which label to retrieve
- Returns
torch Tensor
- get_number_of_sequences()
Gets number of sequences (typhoons) in the dataset
- Returns
integer number of sequences
- get_number_of_nonempty_sequences()
Gets number of sequences (typhoons) in the dataset that have at least 1 image
- Returns
integer number of sequences
- get_sequence_ids() List[str]
Returns a list of the sequence ID’s in the dataset, as strings
- Returns
List[str]
- get_seasons() List[int]
Returns a list of the seasons that typhoons have started in chronological order
- Returns
List[int]
- get_nonempty_seasons() List[int]
Returns a list of the seasons that typhoons have started in, that have at least one image, in chronological order
- Returns
List[int]
- sequence_exists(seq_str: str) bool
Returns if a seq_str with given seq_str number exists in the dataset
- Parameters
seq_str – string of the seq_str ID
- Returns
Boolean True if present, False otherwise
- get_ith_sequence(idx: int) DigitalTyphoonSequence
Given an index idx, returns the idx’th sequence in the dataset
- Parameters
idx – int index
- Returns
DigitalTyphoonSequence
- process_metadata_file(filepath: str)
Reads and processes JSON metadata file’s information into dataset.
- Parameters
filepath – path to metadata file
- Returns
metadata JSON object
- get_seq_ids_from_season(season: int) List[str]
Given a start season, give the sequence ID strings of all sequences that start in that season.
- Parameters
season – the start season as a string
- Returns
a list of the sequence IDs starting in that season
- total_image_idx_to_sequence_idx(total_idx: int) int
Given a total dataset image index, returns that image’s index in its respective sequence. e.g. an image that is the 500th in the total dataset may be the 5th image in its sequence.
- Parameters
total_idx – the total dataset image index
- Returns
the inner-sequence image index.
- seq_idx_to_total_image_idx(seq_str: str, seq_idx: int) int
Given an image with seq_idx position within its sequence, return its total idx within the greater dataset. e.g. an image that is the 5th image in the sequence may be the 500th in the total dataset.
- Parameters
seq_str – The sequence ID string to search within
seq_idx – int, the index within the given sequence
- Returns
int, the total index within the dataset
- seq_indices_to_total_indices(seq_obj: DigitalTyphoonSequence) List[int]
Given a sequence, return a list of the total dataset indices of the sequence’s images.
- Parameters
seq_obj – the DigitalTyphoonSequence object to produce the list from
- Returns
the List of total dataset indices
- get_image_from_idx(idx) DigitalTyphoonImage
Given a dataset image idx, returns the image object from that index.
- Parameters
idx – int, the total dataset image idx
- Returns
DigitalTyphoonImage object for that image
pyphoon2.DigitalTyphoonImage module
- class pyphoon2.DigitalTyphoonImage.DigitalTyphoonImage(image_filepath: str, track_entry: ndarray, sequence_id=None, load_imgs_into_mem=False, transform_func=None, spectrum='Infrared')
Bases:
object- __init__(image_filepath: str, track_entry: ndarray, sequence_id=None, load_imgs_into_mem=False, transform_func=None, spectrum='Infrared')
Class for one image with metadata for the DigitalTyphoonDataset
Does NOT check for file existence until accessing the image.
- Parameters
image_filepath – str, path to image file
track_entry – np.ndarray, 1d numpy array for the track csv entry corresponding to the image
load_imgs_into_mem – bool, flag indicating whether images should be loaded into memory
spectrum – str, default spectrum to read the image in
- param transform_func: this function will be called on the image array when the array is accessed (or read into memory).
It should take and return a numpy image array
- image(spectrum=None) ndarray
Returns the image as a numpy array. If load_imgs_into_mem was set to true, it will cache the image
- Parameters
spectrum – spectrum (channel) the image was taken in
- Returns
np.ndarray, the image
- sequence_id() str
Returns the sequence ID this image belongs to
- Returns
str sequence str
- track_array() ndarray
Returns the csv row for this image
- Returns
nparray containing the track data
- value_from_string(label)
Returns the image’s value given the label as a string. e.g. value_from_string(‘month’) -> the month
- Returns
the element
- year() int
Returns the year the image was taken
- Returns
int, the year
- month() int
Returns the month the image was taken
- Returns
int, the month (1-12)
- day() int
Returns the day the image was taken (number 1-31)
- Returns
int the day
- hour() int
Returns the hour the image was taken
- Returns
int, the hour
- datetime() datetime
Returns a datetime object of when the image was taken
- Returns
datetime
- grade() int
Returns the grade of the typhoon in the image
- Returns
int, the grade
- lat() float
Returns the latitude of the image
- Returns
float
- long() float
Returns the longitude of the image
- Returns
float
- pressure() float
Returns the pressure in # TODO: units? probably hg
- Returns
float
- wind() float
Returns the wind speed in # TODO: units?
- Returns
float
- dir50() float
# TODO: what is this?
- Returns
float
- long50() float
# TODO: what is this?
- Returns
float
- short50() float
# TODO: what is this?
- Returns
float
- dir30() float
# TODO: what is this?
- Returns
float
- long30() float
# TODO: what is this?
- Returns
float
- short30() float
# TODO: what is this?
- Returns
float
- landfall() float
# TODO: what is this?
- Returns
float
- interpolated() bool
Returns whether this entry is interpolated or not
- Returns
bool
- filepath() str
Returns the filepath to the image
- Returns
str
- mask_1() float
Returns number of pixels in the image that are corrupted
- Returns
float the number of pixels
- mask_1_percent() float
Returns percentage of pixels in the image that are corrupted
- Returns
float the percentage of pixels
- set_track_data(track_entry: ndarray) None
Sets the track entry
- Parameters
track_entry – numpy array representing one entry of the track csv
- Returns
None
- set_image_data(image_filepath: str, load_imgs_into_mem=False, spectrum=None) None
Sets the image file data
- Parameters
load_imgs_into_mem – bool, whether to load images into memory
spectrum – str, spectrum to open h5 images with
image_filepath – string to image
- Returns
None
pyphoon2.DigitalTyphoonSequence module
- class pyphoon2.DigitalTyphoonSequence.DigitalTyphoonSequence(seq_str: str, start_season: int, num_images: int, transform_func=None, spectrum='Infrared', verbose=False)
Bases:
object- __init__(seq_str: str, start_season: int, num_images: int, transform_func=None, spectrum='Infrared', verbose=False)
Class representing one typhoon sequence from the DigitalTyphoon dataset
- Parameters
seq_str – str, sequence ID as a string
start_season – int, the season in which the typhoon starts in
num_images – int, number of images in the sequence
transform_func – this function will be called on each image before saving it/returning it. It should take and return a np array
- get_sequence_str() str
Returns the sequence ID as a string
- Returns
string sequence ID
- process_seq_img_dir_into_sequence(directory_path: str, load_imgs_into_mem=False, ignore_list=None, spectrum=None, filter_func=<function DigitalTyphoonSequence.<lambda>>) None
Given a path to a directory containing images of a typhoon sequence, process the images into the current sequence object. If ‘load_imgs_into_mem’ is set to True, the images will be read as numpy arrays and stored in memory. Spectrum refers to what light spectrum the image lies in.
- Parameters
directory_path – Path to the typhoon sequence directory
load_imgs_into_mem – Bool representing if images should be loaded into memory
ignore_list – list of image filenames to ignore
spectrum – string representing what spectrum the image lies in
filter_func – function that accepts an image and returns True or False if it should be included in the sequence
- Returns
None
- get_start_season() int
Get the start season of the sequence
- Returns
int, the start season
- get_num_images() int
Gets the number of images in the sequence
- Returns
int
- get_num_original_images() int
Get the number of images in the sequence
- Returns
int, the number of images
- has_images() bool
Returns true if the sequence currently holds images (or image filepaths). False otherwise.
- Returns
bool
- process_track_data(track_filepath: str, csv_delimiter=',') None
Takes in the track data for the sequence and processes it into the images for the sequence.
- Parameters
track_filepath – str, path to track csv
csv_delimiter – delimiter for the csv file
- Returns
None
- add_track_data(filename: str, csv_delimiter=',') None
Reads and adds the track data to the sequence.
- Parameters
filename – str, path to the track data
csv_delimiter – char, delimiter to use to read the csv
- Returns
None
- set_track_path(track_path: str) None
Sets the path to the track data file
- Parameters
track_path – str, filepath to the track data
- Returns
None
- get_track_path() str
Gets the path to the track data file
- Returns
str, the path to the track data file
- get_track_data() ndarray
Returns the track csv data as a numpy array, with each row corresponding to a row in the CSV.
- Returns
np.ndarray
- get_image_at_idx(idx: int, spectrum='Infrared') DigitalTyphoonImage
Returns the idx’th DigitalTyphoonImage in the sequence. raises an exception if the idx is out of the the sequence’s range
- Parameters
idx – int, idx to access
spectrum – str, spectrum of the image
- Returns
DigitalTyphoonImage, the image object
- get_image_at_idx_as_numpy(idx: int, spectrum=None) ndarray
Gets the idx’th image in the sequence as a numpy array. Raises an exception if the idx is outside of the sequence’s range.
- Parameters
idx – int, idx to access
spectrum – str, spectrum of the image
- Returns
np.ndarray, image as a numpy array with shape of the image dimensions
- get_all_images_in_sequence() List[DigitalTyphoonImage]
Returns all of the image objects (DigitalTyphoonImage) in the sequence in order.
- Returns
List[DigitalTyphoonImage]
- return_all_images_in_sequence_as_np(spectrum=None) ndarray
Returns all the images in a sequence as a numpy array of shape (num_images, image_shape[0], image_shape[1])
- Parameters
spectrum – str, spectrum of the image
- Returns
np.ndarray of shape (num_image, image_shape[0], image_shape[1])
- num_images_match_num_expected() bool
Returns True if the number of image filepaths stored matches the number of images stated when initializing the sequence object. False otherwise.
- Returns
bool
- get_image_filepaths() List[str]
Returns a list of the filenames of the images (without the root path)
- Returns
List[str], list of the filenames
- set_images_root_path(images_root_path: str) None
Sets the root path of the images.
- Parameters
images_root_path – str, the root path
- Returns
None
- get_images_root_path() str
Gets the root path to the image directory
- Returns
str, the root path
pyphoon2.DigitalTyphoonUtils module
- class pyphoon2.DigitalTyphoonUtils.SPLIT_UNIT(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)
Bases:
EnumEnum denoting which unit to treat as atomic when splitting the dataset
- SEQUENCE = 'sequence'
- SEASON = 'season'
- IMAGE = 'image'
- classmethod has_value(value)
Returns true if value is present in the enum
- Parameters
value – str, the value to check for
- Returns
bool
- class pyphoon2.DigitalTyphoonUtils.LOAD_DATA(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)
Bases:
EnumEnum denoting what level of data should be stored in memory
- NO_DATA = False
- ONLY_TRACK = 'track'
- ONLY_IMG = 'images'
- ALL_DATA = 'all_data'
- classmethod has_value(value)
- class pyphoon2.DigitalTyphoonUtils.TRACK_COLS(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)
Bases:
EnumEnum containing indices in a track csv col to find the respective data
- YEAR = 0
- MONTH = 1
- DAY = 2
- HOUR = 3
- GRADE = 4
- LAT = 5
- LNG = 6
- PRESSURE = 7
- WIND = 8
- DIR50 = 9
- LONG50 = 10
- SHORT50 = 11
- DIR30 = 12
- LONG30 = 13
- SHORT30 = 14
- LANDFALL = 15
- INTERPOLATED = 16
- FILENAME = 17
- MASK_1 = 18
- MASK_1_PERCENT = 19
- classmethod str_to_value(name)
- classmethod has_value(value)
- pyphoon2.DigitalTyphoonUtils.parse_image_filename(filename: str, separator='-') -> (<class 'str'>, <class 'datetime.datetime'>, <class 'str'>)
Takes the filename of a Digital Typhoon image and parses it to return the date it was taken, the sequence ID it belongs to, and the satellite that took the image
- Parameters
filename – str, filename of the image
separator – char, separator used in the filename
- Returns
(str, datetime, str), Tuple containing the sequence ID, the datetime, and satellite string
- pyphoon2.DigitalTyphoonUtils.get_seq_str_from_track_filename(filename: str) str
Given a track filename, returns the sequence ID it belongs to
- Parameters
filename – str, the filename
- Returns
str, the sequence ID string
- pyphoon2.DigitalTyphoonUtils.is_image_file(filename: str) bool
Given a DigitalTyphoon file, returns if it is an h5 image.
- Parameters
filename – str, the filename
- Returns
bool, True if it is an h5 image, False otherwise