vulcanai.datasets package¶
vulcanai.datasets.fashion module¶
-
class
vulcanai.datasets.fashion.
FashionData
(root, train=True, transform=None, target_transform=None, download=False)¶ Bases:
torch.utils.data.dataset.Dataset
‘MNIST <http://yann.lecun.com/exdb/mnist/>`_ Dataset.
- Parameters:
- root (string): Root directory of dataset where
processed/training.pt
- and
processed/test.pt
exist. - train (bool, optional): If True, creates dataset from
training.pt
, - otherwise from
test.pt
. - download (bool, optional): If true, downloads the dataset from the internet and
- puts it in root directory. If dataset is already downloaded, it is not downloaded again.
- transform (callable, optional): A function/transform that takes in an PIL image
- and returns a transformed version. E.g,
transforms.RandomCrop
- target_transform (callable, optional): A function/transform that takes in the
- target and transforms it.
- root (string): Root directory of dataset where
-
__init__
(root, train=True, transform=None, target_transform=None, download=False)¶ Initialize self. See help(type(self)) for accurate signature.
-
download
()¶ Download the MNIST data if it doesn’t exist in processed_folder already.
-
processed_folder
= 'processed'¶
-
raw_folder
= 'raw'¶
-
test_file
= 'test.pt'¶
-
traiining_file
= 'training.pt'¶
-
urls
= ['http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz', 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz', 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz', 'http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz']¶
-
vulcanai.datasets.fashion.
get_int
(b)¶
-
vulcanai.datasets.fashion.
parse_byte
(b)¶
-
vulcanai.datasets.fashion.
read_image_file
(path)¶
-
vulcanai.datasets.fashion.
read_label_file
(path)¶
vulcanai.datasets.multidataset module¶
Defines the MultiDataset Class
-
class
vulcanai.datasets.multidataset.
MultiDataset
(dataset_tuples)¶ Bases:
torch.utils.data.dataset.Dataset
Define a dataset for multi input networks.
Takes in a list of datasets, and whether or not their input_data and target data should be output.
- Parameters:
- dataset_tuples : list of tuples
- Each tuple being (Dataset, use_data_boolean, use_target_boolean). A list of tuples, wherein each tuple should have the Dataset in the zero index, a boolean of whether to include the input_data in the first index, and a boolean of whether to include the target data in the second index. You can only specificy one target at a time throughout all incoming datasets.
- Returns:
- multi_dataset : torch.utils.data.Dataset
-
__init__
(dataset_tuples)¶ Initialize a dataset for multi input networks.
vulcanai.datasets.tabulardataset module¶
This file defines the TabularDataset Class
-
class
vulcanai.datasets.tabulardataset.
TabularDataset
(label_column=None, merge_on_columns=None, index_list=None, na_values=None, **dataset_dict)¶ Bases:
torch.utils.data.dataset.Dataset
This defines a dataset, subclassed from torch.utils.data.Dataset.
It uses pd.dataframe as the backend, with utility functions.
- Parameters:
- label_column: String
- The name of the label column. Provide None if you do not want a target.
- merge_on_columns: list of strings
- Key(s) that specifies which columns to use to uniquely stitch dataset (default None)
- index_list: list of strings
- List of columns to make the index of the dataframe
na_values: The values to convert to NaN when reading from csv dataset_dict: keyword parameter, value is dataframe or path string
pandas dataframe assigned to keyword argument that produces a dictionary variable.Example: >>> df_test_one: pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'], 'B': ['B0', 'B1', 'B2', 'B3'], 'C': ['C0', 'C1', 'C2', 'C3'], 'D': ['D0', 'D1', 'D2', 'D3']}, index=[0, 1, 2, 3]),
>>> df_test_two: pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'], 'B': ['B4', 'B5', 'B6', 'B7'], 'C': ['C4', 'C5', 'C6', 'C7'], 'D': ['D4', 'D5', 'D6', 'D7']}, index=[4, 5, 6, 7])} >>> tab_dataset_var = TabularDataset(merge_on_columns=['A'], index_list=['A'], df1=df_test_one, df2=df_test_two)
-
__init__
(label_column=None, merge_on_columns=None, index_list=None, na_values=None, **dataset_dict)¶ Creates an instance of TabularDataset.
-
create_label_encoding
(column_name, ordered_values)¶ Create label encoding for the given column
- Parameters:
- column_name: String
- The name of the column you want encoded
- ordered_values: List or Dict
- Either an ordered list of possible column values. Or a mapping of column value to label value. Must include all possible values
-
create_one_hot_encoding
(column_name, prefix_sep='@')¶ Create one-hot encoding for the given column.
- Parameters:
- column_name: String
- The name of the column you want to one-hot encode
- prefix_sep String default(“@”)
- The prefix used when creating a one-hot encoding
-
delete_column
(column_name)¶ Deletes the given column.
- Parameters:
- column_name: String
- The name of the column you want deleted
Identify columns that are highly correlated with one-another.
- Parameters:
- threshold: Between 0 (weakest correlation) and 1
- (strongest correlation). Minimum amount of correlation necessary to be identified.
- Returns:
- column list: List of tuples
- The correlation values above threshold
-
identify_low_variance
(threshold)¶ Identify those columns that have low variance
- Parameters:
- threshold: Float
- Between 0 an 1. Maximum amount of variance necessary to be identified as low variance.
- Returns:
- variance_dict: Dict
- A dictionary of column names, with the value being their variance.
-
identify_sufficient_non_null
(threshold)¶ Return columns where there is at least threshold percent of non-null values.
- Parameters:
- threshold: Float
- A number between 0 and 1, representing proportion of non-null.
- Returns:
- cols: List
- A list of those columns with threshold percentage null values
-
identify_unbalanced_columns
(threshold, non_numeric=True)¶ This returns columns that are highly unbalanced. Those that have a disproportionate amount of one value.
- Parameters:
- threshold: Float
- Proportion needed to define unbalanced, between 0 and 1 0 is a lesser proportion of the one value, (less imbalanced)
- non_numeric: Boolean
- Whether non-numeric columns are also considered.
- Returns:
- column_list: List
- The list of column names
-
identify_unique
(threshold)¶ Returns columns that have at least threshold number of values
- Parameters:
- threshold: The minimum number of values needed.
- Must be greater than 1. Not between 0 and 1, but rather represents the number of values
- Returns:
- column_list: list
- The list of columns having threshold number of values
-
list_all_column_values
(column_name)¶ Return a list of all values in this column.
- Parameters:
- column_name: String
- Name of the column
-
list_all_features
()¶ Lists all features (columns)
-
merge_dataframe
(merge_on_columns=None, index_list=None, na_values=None, **dataset_dict)¶ Merges additional data into a TabularDataset isntance
- Parameters:
- merge_on_columns: list of strings
- Key(s) that specifies which columns to use to uniquely stitch dataset (default None)
- index_list: list of strings
- List of columns to make the index of the dataframe
na_values: The values to convert to NaN when reading from csv dataset_dict: keyword parameter, value is dataframe or path string
pandas dataframe assigned to keyword argument that produces a dictionary variable.
-
print_column_data_types
()¶ Prints the data types of all columns.
-
replace_value_in_column
(column_name, current_values, target_values)¶ Replaces one or more values in the given columns.
- Parameters:
column_name: String current_values: List
Must be existing values- target_values: List
- Must be valid for pandas dataframe
-
reverse_create_one_hot_encoding
(column_name=None, prefix_sep='@')¶ Undo the creation of one-hot encodings, if prefix_sep was used to create one-hot encodings and nowhere else. If a column_name is provided, only that column will be reverse-encoded, otherwise all will be reverse-encoded.
- Parameters
- column_name: String
- The name of the column to reverse the one-hot encoding for
- prefix_sep: String default(“@”)
- The prefix used when creating a one-hot encoding
-
save_dataframe
(file_path)¶ Save the dataframe to a file.
- Parameters:
- file_path: String
- Path to the file where you want your dataframe to be save
-
set_global_random_seed
(seed_value)¶ Initializes the random state using the seed_value
- Parameters:
- seed_value Int
- The seed value
-
split
(split_ratio=0.7, stratified=False, strata_field='label', random_state=None)¶ Create train-test(-validation) splits from the instance’s examples. Function signature borrowed from torchtext in an effort to maintain consistency https://github.com/pytorch/text/blob/master/torchtext/data/dataset.py also partially modified from https://stackoverflow.com/questions/38250710/ how-to-split-data-into-3-sets-train-validation-and-test
- Parameters:
- split_ratio: Float
- a number [0, 1] denoting the amount of data to be used for the training split (rest is used for validation), or a list of numbers denoting the relative sizes of train, test and valid splits respectively. If the relative size for valid is missing, only the train-test split is returned. Default is 0.7 (for th train set).
- stratified: Boolean
- whether the sampling should be stratified.
- Default is False.
- strata_field: String
- name of the examples Field stratified over. Default is ‘label’ for the conventional label field.
- random_state: int
- The random seed used for shuffling.
- Returns:
- datasets: Tuple of TabularDatasets
- Datasets for train, validation, and test splits in that order, if the splits are provided.
vulcanai.datasets.utils module¶
This file contains utility methods that many be useful to several dataset classes. check_split_ration, stratify, rationed_split, randomshuffler were all copy-pasted from torchtext because torchtext is not yet packaged for anaconda and is therefore not yet a reasonable dependency. See https://github.com/pytorch/text/blob/master/torchtext/data/dataset.py
-
vulcanai.datasets.utils.
check_split_ratio
(split_ratio)¶ Check that the split ratio argument is not malformed
Parameters:
- split_ratio: desired split ratio, either a list of length 2 or 3
- depending if the validation set is desired.
- Returns:
- split ratio as tuple
-
vulcanai.datasets.utils.
clean_dataframe
(df)¶ Goes through and ensures that all nonsensical values are encoded as NaNs :param df: :return:
-
vulcanai.datasets.utils.
stitch_datasets
(df_main=None, merge_on_columns=None, index_list=None, **dataset_dict)¶ Function to produce a single dataset from multiple.
- Parameters:
- df_dict : dictionary of dataframes to concatenated
- dictionary {key = df name: value = dataframe} of dataframes to stitch together.
- merge_on_columns : list of strings
- key(s) that specifies which columns to use to uniquely stitch dataset (default None)
- index_list: list of strings
- columns to establish as index for final stitched dataset (default None)
- dataset_dict : keyword parameter, value is dataframe
- pandas dataframe assigned to keyword argument that produces a dictionary variable.
- Returns:
- merged_df : dataframe
- concatenated dataframe
Example: >>> df_test_one: pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'], 'B': ['B0', 'B1', 'B2', 'B3'], 'C': ['C0', 'C1', 'C2', 'C3'], 'D': ['D0', 'D1', 'D2', 'D3']}, index=[0, 1, 2, 3]),
>>> df_test_two: pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'], 'B': ['B4', 'B5', 'B6', 'B7'], 'C': ['C4', 'C5', 'C6', 'C7'], 'D': ['D4', 'D5', 'D6', 'D7']}, index=[4, 5, 6, 7])}
>>> df_stitched = stitch_datasets(merge_on_columns=['A'], index_list=['A'], df1=df_test_one, df2=df_test_two)