vulcanai.datasets package

vulcanai.datasets.fashion module

class vulcanai.datasets.fashion.FashionData(root, train=True, transform=None, target_transform=None, download=False)

Bases: torch.utils.data.dataset.Dataset

‘MNIST <http://yann.lecun.com/exdb/mnist/>`_ Dataset.

Parameters:
root (string): Root directory of dataset where processed/training.pt
and processed/test.pt exist.
train (bool, optional): If True, creates dataset from training.pt,
otherwise from test.pt.
download (bool, optional): If true, downloads the dataset from the internet and
puts it in root directory. If dataset is already downloaded, it is not downloaded again.
transform (callable, optional): A function/transform that takes in an PIL image
and returns a transformed version. E.g, transforms.RandomCrop
target_transform (callable, optional): A function/transform that takes in the
target and transforms it.
__init__(root, train=True, transform=None, target_transform=None, download=False)

Initialize self. See help(type(self)) for accurate signature.

download()

Download the MNIST data if it doesn’t exist in processed_folder already.

processed_folder = 'processed'
raw_folder = 'raw'
test_file = 'test.pt'
traiining_file = 'training.pt'
urls = ['http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz', 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz', 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz', 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz']
vulcanai.datasets.fashion.get_int(b)
vulcanai.datasets.fashion.parse_byte(b)
vulcanai.datasets.fashion.read_image_file(path)
vulcanai.datasets.fashion.read_label_file(path)

vulcanai.datasets.multidataset module

Defines the MultiDataset Class

class vulcanai.datasets.multidataset.MultiDataset(dataset_tuples)

Bases: torch.utils.data.dataset.Dataset

Define a dataset for multi input networks.

Takes in a list of datasets, and whether or not their input_data and target data should be output.

Parameters:
dataset_tuples : list of tuples
Each tuple being (Dataset, use_data_boolean, use_target_boolean). A list of tuples, wherein each tuple should have the Dataset in the zero index, a boolean of whether to include the input_data in the first index, and a boolean of whether to include the target data in the second index. You can only specificy one target at a time throughout all incoming datasets.
Returns:
multi_dataset : torch.utils.data.Dataset
__init__(dataset_tuples)

Initialize a dataset for multi input networks.

vulcanai.datasets.tabulardataset module

This file defines the TabularDataset Class

class vulcanai.datasets.tabulardataset.TabularDataset(label_column=None, merge_on_columns=None, index_list=None, na_values=None, **dataset_dict)

Bases: torch.utils.data.dataset.Dataset

This defines a dataset, subclassed from torch.utils.data.Dataset.

It uses pd.dataframe as the backend, with utility functions.

Parameters:
label_column: String
The name of the label column. Provide None if you do not want a target.
merge_on_columns: list of strings
Key(s) that specifies which columns to use to uniquely stitch dataset (default None)
index_list: list of strings
List of columns to make the index of the dataframe

na_values: The values to convert to NaN when reading from csv dataset_dict: keyword parameter, value is dataframe or path string

pandas dataframe assigned to keyword argument that produces a dictionary variable.
Example:
>>> df_test_one: pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                           'B': ['B0', 'B1', 'B2', 'B3'],
                           'C': ['C0', 'C1', 'C2', 'C3'],
                           'D': ['D0', 'D1', 'D2', 'D3']},
                           index=[0, 1, 2, 3]),
>>> df_test_two: pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                            'B': ['B4', 'B5', 'B6', 'B7'],
                            'C': ['C4', 'C5', 'C6', 'C7'],
                            'D': ['D4', 'D5', 'D6', 'D7']},
                            index=[4, 5, 6, 7])}
>>> tab_dataset_var = TabularDataset(merge_on_columns=['A'],
                    index_list=['A'], df1=df_test_one, df2=df_test_two)
__init__(label_column=None, merge_on_columns=None, index_list=None, na_values=None, **dataset_dict)

Creates an instance of TabularDataset.

create_label_encoding(column_name, ordered_values)

Create label encoding for the given column

Parameters:
column_name: String
The name of the column you want encoded
ordered_values: List or Dict
Either an ordered list of possible column values. Or a mapping of column value to label value. Must include all possible values
create_one_hot_encoding(column_name, prefix_sep='@')

Create one-hot encoding for the given column.

Parameters:
column_name: String
The name of the column you want to one-hot encode
prefix_sep String default(“@”)
The prefix used when creating a one-hot encoding
delete_column(column_name)

Deletes the given column.

Parameters:
column_name: String
The name of the column you want deleted
identify_highly_correlated(threshold)

Identify columns that are highly correlated with one-another.

Parameters:
threshold: Between 0 (weakest correlation) and 1
(strongest correlation). Minimum amount of correlation necessary to be identified.
Returns:
column list: List of tuples
The correlation values above threshold
identify_low_variance(threshold)

Identify those columns that have low variance

Parameters:
threshold: Float
Between 0 an 1. Maximum amount of variance necessary to be identified as low variance.
Returns:
variance_dict: Dict
A dictionary of column names, with the value being their variance.
identify_sufficient_non_null(threshold)

Return columns where there is at least threshold percent of non-null values.

Parameters:
threshold: Float
A number between 0 and 1, representing proportion of non-null.
Returns:
cols: List
A list of those columns with threshold percentage null values
identify_unbalanced_columns(threshold, non_numeric=True)

This returns columns that are highly unbalanced. Those that have a disproportionate amount of one value.

Parameters:
threshold: Float
Proportion needed to define unbalanced, between 0 and 1 0 is a lesser proportion of the one value, (less imbalanced)
non_numeric: Boolean
Whether non-numeric columns are also considered.
Returns:
column_list: List
The list of column names
identify_unique(threshold)

Returns columns that have at least threshold number of values

Parameters:
threshold: The minimum number of values needed.
Must be greater than 1. Not between 0 and 1, but rather represents the number of values
Returns:
column_list: list
The list of columns having threshold number of values
list_all_column_values(column_name)

Return a list of all values in this column.

Parameters:
column_name: String
Name of the column
list_all_features()

Lists all features (columns)

merge_dataframe(merge_on_columns=None, index_list=None, na_values=None, **dataset_dict)

Merges additional data into a TabularDataset isntance

Parameters:
merge_on_columns: list of strings
Key(s) that specifies which columns to use to uniquely stitch dataset (default None)
index_list: list of strings
List of columns to make the index of the dataframe

na_values: The values to convert to NaN when reading from csv dataset_dict: keyword parameter, value is dataframe or path string

pandas dataframe assigned to keyword argument that produces a dictionary variable.
print_column_data_types()

Prints the data types of all columns.

replace_value_in_column(column_name, current_values, target_values)

Replaces one or more values in the given columns.

Parameters:

column_name: String current_values: List

Must be existing values
target_values: List
Must be valid for pandas dataframe
reverse_create_one_hot_encoding(column_name=None, prefix_sep='@')

Undo the creation of one-hot encodings, if prefix_sep was used to create one-hot encodings and nowhere else. If a column_name is provided, only that column will be reverse-encoded, otherwise all will be reverse-encoded.

Parameters
column_name: String
The name of the column to reverse the one-hot encoding for
prefix_sep: String default(“@”)
The prefix used when creating a one-hot encoding
save_dataframe(file_path)

Save the dataframe to a file.

Parameters:
file_path: String
Path to the file where you want your dataframe to be save
set_global_random_seed(seed_value)

Initializes the random state using the seed_value

Parameters:
seed_value Int
The seed value
split(split_ratio=0.7, stratified=False, strata_field='label', random_state=None)

Create train-test(-validation) splits from the instance’s examples. Function signature borrowed from torchtext in an effort to maintain consistency https://github.com/pytorch/text/blob/master/torchtext/data/dataset.py also partially modified from https://stackoverflow.com/questions/38250710/ how-to-split-data-into-3-sets-train-validation-and-test

Parameters:
split_ratio: Float
a number [0, 1] denoting the amount of data to be used for the training split (rest is used for validation), or a list of numbers denoting the relative sizes of train, test and valid splits respectively. If the relative size for valid is missing, only the train-test split is returned. Default is 0.7 (for th train set).
stratified: Boolean
whether the sampling should be stratified.
Default is False.
strata_field: String
name of the examples Field stratified over. Default is ‘label’ for the conventional label field.
random_state: int
The random seed used for shuffling.
Returns:
datasets: Tuple of TabularDatasets
Datasets for train, validation, and test splits in that order, if the splits are provided.

vulcanai.datasets.utils module

This file contains utility methods that many be useful to several dataset classes. check_split_ration, stratify, rationed_split, randomshuffler were all copy-pasted from torchtext because torchtext is not yet packaged for anaconda and is therefore not yet a reasonable dependency. See https://github.com/pytorch/text/blob/master/torchtext/data/dataset.py

vulcanai.datasets.utils.check_split_ratio(split_ratio)

Check that the split ratio argument is not malformed

Parameters:

split_ratio: desired split ratio, either a list of length 2 or 3
depending if the validation set is desired.
Returns:
split ratio as tuple
vulcanai.datasets.utils.clean_dataframe(df)

Goes through and ensures that all nonsensical values are encoded as NaNs :param df: :return:

vulcanai.datasets.utils.stitch_datasets(df_main=None, merge_on_columns=None, index_list=None, **dataset_dict)

Function to produce a single dataset from multiple.

Parameters:
df_dict : dictionary of dataframes to concatenated
dictionary {key = df name: value = dataframe} of dataframes to stitch together.
merge_on_columns : list of strings
key(s) that specifies which columns to use to uniquely stitch dataset (default None)
index_list: list of strings
columns to establish as index for final stitched dataset (default None)
dataset_dict : keyword parameter, value is dataframe
pandas dataframe assigned to keyword argument that produces a dictionary variable.
Returns:
merged_df : dataframe
concatenated dataframe
Example:
>>> df_test_one: pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
                               'B': ['B0', 'B1', 'B2', 'B3'],
                               'C': ['C0', 'C1', 'C2', 'C3'],
                               'D': ['D0', 'D1', 'D2', 'D3']},
                               index=[0, 1, 2, 3]),
>>> df_test_two: pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                               'B': ['B4', 'B5', 'B6', 'B7'],
                               'C': ['C4', 'C5', 'C6', 'C7'],
                               'D': ['D4', 'D5', 'D6', 'D7']},
                               index=[4, 5, 6, 7])}
>>> df_stitched = stitch_datasets(merge_on_columns=['A'], index_list=['A'], df1=df_test_one, df2=df_test_two)