vulcanai.datasets package¶

vulcanai.datasets.fashion module¶

class vulcanai.datasets.fashion.FashionData(root, train=True, transform=None, target_transform=None, download=False)¶

Bases: torch.utils.data.dataset.Dataset

‘MNIST <http://yann.lecun.com/exdb/mnist/>`_ Dataset.

Parameters:

root (string): Root directory of dataset where processed/training.pt: and processed/test.pt exist.
train (bool, optional): If True, creates dataset from training.pt,: otherwise from test.pt.
download (bool, optional): If true, downloads the dataset from the internet and: puts it in root directory. If dataset is already downloaded, it is not downloaded again.
transform (callable, optional): A function/transform that takes in an PIL image: and returns a transformed version. E.g, transforms.RandomCrop
target_transform (callable, optional): A function/transform that takes in the: target and transforms it.

__init__(root, train=True, transform=None, target_transform=None, download=False)¶: Initialize self. See help(type(self)) for accurate signature.

download()¶: Download the MNIST data if it doesn’t exist in processed_folder already.

processed_folder = 'processed'¶

raw_folder = 'raw'¶

test_file = 'test.pt'¶

traiining_file = 'training.pt'¶

urls = ['http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz', 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz', 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz', 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz']¶

vulcanai.datasets.fashion.get_int(b)¶

vulcanai.datasets.fashion.parse_byte(b)¶

vulcanai.datasets.fashion.read_image_file(path)¶

vulcanai.datasets.fashion.read_label_file(path)¶

vulcanai.datasets.multidataset module¶

Defines the MultiDataset Class

class vulcanai.datasets.multidataset.MultiDataset(dataset_tuples)¶

Bases: torch.utils.data.dataset.Dataset

Define a dataset for multi input networks.

Takes in a list of datasets, and whether or not their input_data and target data should be output.

Parameters:

dataset_tuples : list of tuples: Each tuple being (Dataset, use_data_boolean, use_target_boolean). A list of tuples, wherein each tuple should have the Dataset in the zero index, a boolean of whether to include the input_data in the first index, and a boolean of whether to include the target data in the second index. You can only specificy one target at a time throughout all incoming datasets.

Returns:

multi_dataset : torch.utils.data.Dataset

__init__(dataset_tuples)¶: Initialize a dataset for multi input networks.

vulcanai.datasets.tabulardataset module¶

This file defines the TabularDataset Class

class vulcanai.datasets.tabulardataset.TabularDataset(label_column=None, merge_on_columns=None, index_list=None, na_values=None, **dataset_dict)¶

Bases: torch.utils.data.dataset.Dataset

This defines a dataset, subclassed from torch.utils.data.Dataset.

It uses pd.dataframe as the backend, with utility functions.

Parameters:

label_column: String

The name of the label column. Provide None if you do not want a target.

merge_on_columns: list of strings

Key(s) that specifies which columns to use to uniquely stitch dataset (default None)

index_list: list of strings

List of columns to make the index of the dataframe

na_values: The values to convert to NaN when reading from csv dataset_dict: keyword parameter, value is dataframe or path string

pandas dataframe assigned to keyword argument that produces a dictionary variable.

Example:

>>> df_test_one: pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                           'B': ['B0', 'B1', 'B2', 'B3'],
                           'C': ['C0', 'C1', 'C2', 'C3'],
                           'D': ['D0', 'D1', 'D2', 'D3']},
                           index=[0, 1, 2, 3]),

>>> df_test_two: pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                            'B': ['B4', 'B5', 'B6', 'B7'],
                            'C': ['C4', 'C5', 'C6', 'C7'],
                            'D': ['D4', 'D5', 'D6', 'D7']},
                            index=[4, 5, 6, 7])}
>>> tab_dataset_var = TabularDataset(merge_on_columns=['A'],
                    index_list=['A'], df1=df_test_one, df2=df_test_two)

__init__(label_column=None, merge_on_columns=None, index_list=None, na_values=None, **dataset_dict)¶: Creates an instance of TabularDataset.

create_label_encoding(column_name, ordered_values)¶

Create label encoding for the given column

Parameters:

column_name: String: The name of the column you want encoded
ordered_values: List or Dict: Either an ordered list of possible column values. Or a mapping of column value to label value. Must include all possible values

create_one_hot_encoding(column_name, prefix_sep='@')¶

Create one-hot encoding for the given column.

Parameters:

column_name: String: The name of the column you want to one-hot encode
prefix_sep String default(“@”): The prefix used when creating a one-hot encoding

delete_column(column_name)¶

Deletes the given column.

Parameters:

column_name: String: The name of the column you want deleted

identify_highly_correlated(threshold)¶

Identify columns that are highly correlated with one-another.

Parameters:

threshold: Between 0 (weakest correlation) and 1: (strongest correlation). Minimum amount of correlation necessary to be identified.

Returns:

column list: List of tuples: The correlation values above threshold

identify_low_variance(threshold)¶

Identify those columns that have low variance

Parameters:

threshold: Float: Between 0 an 1. Maximum amount of variance necessary to be identified as low variance.

Returns:

variance_dict: Dict: A dictionary of column names, with the value being their variance.

identify_sufficient_non_null(threshold)¶

Return columns where there is at least threshold percent of non-null values.

Parameters:

threshold: Float: A number between 0 and 1, representing proportion of non-null.

Returns:

cols: List: A list of those columns with threshold percentage null values

identify_unbalanced_columns(threshold, non_numeric=True)¶

This returns columns that are highly unbalanced. Those that have a disproportionate amount of one value.

Parameters:

threshold: Float: Proportion needed to define unbalanced, between 0 and 1 0 is a lesser proportion of the one value, (less imbalanced)
non_numeric: Boolean: Whether non-numeric columns are also considered.

Returns:

column_list: List: The list of column names

identify_unique(threshold)¶

Returns columns that have at least threshold number of values

Parameters:

threshold: The minimum number of values needed.: Must be greater than 1. Not between 0 and 1, but rather represents the number of values

Returns:

column_list: list: The list of columns having threshold number of values

list_all_column_values(column_name)¶

Return a list of all values in this column.

Parameters:

column_name: String: Name of the column

list_all_features()¶: Lists all features (columns)

merge_dataframe(merge_on_columns=None, index_list=None, na_values=None, **dataset_dict)¶

Merges additional data into a TabularDataset isntance

Parameters:

merge_on_columns: list of strings: Key(s) that specifies which columns to use to uniquely stitch dataset (default None)
index_list: list of strings: List of columns to make the index of the dataframe

na_values: The values to convert to NaN when reading from csv dataset_dict: keyword parameter, value is dataframe or path string

pandas dataframe assigned to keyword argument that produces a dictionary variable.

print_column_data_types()¶: Prints the data types of all columns.

replace_value_in_column(column_name, current_values, target_values)¶

Replaces one or more values in the given columns.

Parameters:

column_name: String current_values: List

Must be existing values

target_values: List: Must be valid for pandas dataframe

reverse_create_one_hot_encoding(column_name=None, prefix_sep='@')¶

Undo the creation of one-hot encodings, if prefix_sep was used to create one-hot encodings and nowhere else. If a column_name is provided, only that column will be reverse-encoded, otherwise all will be reverse-encoded.

Parameters

column_name: String: The name of the column to reverse the one-hot encoding for
prefix_sep: String default(“@”): The prefix used when creating a one-hot encoding

save_dataframe(file_path)¶

Save the dataframe to a file.

Parameters:

file_path: String: Path to the file where you want your dataframe to be save

set_global_random_seed(seed_value)¶

Initializes the random state using the seed_value

Parameters:

seed_value Int: The seed value

split(split_ratio=0.7, stratified=False, strata_field='label', random_state=None)¶

Create train-test(-validation) splits from the instance’s examples. Function signature borrowed from torchtext in an effort to maintain consistency https://github.com/pytorch/text/blob/master/torchtext/data/dataset.py also partially modified from https://stackoverflow.com/questions/38250710/ how-to-split-data-into-3-sets-train-validation-and-test

Parameters:

split_ratio: Float

a number [0, 1] denoting the amount of data to be used for the training split (rest is used for validation), or a list of numbers denoting the relative sizes of train, test and valid splits respectively. If the relative size for valid is missing, only the train-test split is returned. Default is 0.7 (for th train set).

stratified: Boolean

whether the sampling should be stratified.: Default is False.

strata_field: String

name of the examples Field stratified over. Default is ‘label’ for the conventional label field.

random_state: int

The random seed used for shuffling.

Returns:

datasets: Tuple of TabularDatasets: Datasets for train, validation, and test splits in that order, if the splits are provided.

vulcanai.datasets.utils module¶

This file contains utility methods that many be useful to several dataset classes. check_split_ration, stratify, rationed_split, randomshuffler were all copy-pasted from torchtext because torchtext is not yet packaged for anaconda and is therefore not yet a reasonable dependency. See https://github.com/pytorch/text/blob/master/torchtext/data/dataset.py

vulcanai.datasets.utils.check_split_ratio(split_ratio)¶

Check that the split ratio argument is not malformed

Parameters:

split_ratio: desired split ratio, either a list of length 2 or 3

depending if the validation set is desired.

Returns:: split ratio as tuple

vulcanai.datasets.utils.clean_dataframe(df)¶: Goes through and ensures that all nonsensical values are encoded as NaNs :param df: :return:

vulcanai.datasets.utils.stitch_datasets(df_main=None, merge_on_columns=None, index_list=None, **dataset_dict)¶

Function to produce a single dataset from multiple.

Parameters:

df_dict : dictionary of dataframes to concatenated: dictionary {key = df name: value = dataframe} of dataframes to stitch together.
merge_on_columns : list of strings: key(s) that specifies which columns to use to uniquely stitch dataset (default None)
index_list: list of strings: columns to establish as index for final stitched dataset (default None)
dataset_dict : keyword parameter, value is dataframe: pandas dataframe assigned to keyword argument that produces a dictionary variable.

Returns:

merged_df : dataframe: concatenated dataframe

Example:

>>> df_test_one: pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                               'B': ['B0', 'B1', 'B2', 'B3'],
                               'C': ['C0', 'C1', 'C2', 'C3'],
                               'D': ['D0', 'D1', 'D2', 'D3']},
                               index=[0, 1, 2, 3]),

>>> df_test_two: pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                               'B': ['B4', 'B5', 'B6', 'B7'],
                               'C': ['C4', 'C5', 'C6', 'C7'],
                               'D': ['D4', 'D5', 'D6', 'D7']},
                               index=[4, 5, 6, 7])}

>>> df_stitched = stitch_datasets(merge_on_columns=['A'], index_list=['A'], df1=df_test_one, df2=df_test_two)

vulcanai.datasets package¶

vulcanai.datasets.fashion module¶

vulcanai.datasets.multidataset module¶

vulcanai.datasets.tabulardataset module¶

vulcanai.datasets.utils module¶

Vulcan

Navigation

Related Topics