Classification

`classification.naive_bayes.gaussian`

`GaussianNaiveBayes`

Defines a Gaussian Naive Bayes classifier in Python using TensorFlow.

Gaussian Naive Bayes is a classification algorithm that assumes that the features follow a normal distribution. It is a variant of the Naive Bayes algorithm that is used for classification tasks. It is called "naive" because it assumes that the features are independent of each other. This assumption is called "naive" because it is rarely true in real-world applications. However, despite this assumption, Gaussian Naive Bayes performs surprisingly well in many cases.

`evaluate(features, labels)`

The evaluate function calculates the accuracy of the predictions made by the model.

Parameters

features : np.ndarray
    The `features` parameter is a np.ndarray that represents the input features for training the
    model. It has a shape of `(num_samples, num_features)`, where `num_samples` is the number of
    training samples and `num_features` is the number of features for each sample

labels : np.ndarray
    The "labels" parameter is a np.ndarray that contains the class labels for each data point in the
    "features" array. Each element in the "labels" array corresponds to the class label of the
    corresponding data point in the "features" array

Returns

accuracy : float
    The `accuracy` parameter is a float that represents the accuracy of the predictions made by the
    model.

`fit(features, labels, epochs=10)`

The fit function trains a Gaussian Naive Bayes classifier using TensorFlow by calculating class priors and feature parameters, and optimizing them using gradient descent.

Parameters

features : np.ndarray
    The `features` parameter is a np.ndarray that represents the input features for training the
    model. It has a shape of `(num_samples, num_features)`, where `num_samples` is the number of
    training samples and `num_features` is the number of features for each sample

labels : np.ndarray
    The "labels" parameter is a np.ndarray that contains the class labels for each data point in the
    "features" array. Each element in the "labels" array corresponds to the class label of the
    corresponding data point in the "features" array

epochs : int
    The `epochs` parameter is an integer that represents the number of epochs to train the model for.
    An epoch is one iteration over the entire training dataset. For example, if the training dataset
    has 1000 samples and the batch size is 100, then it will take 10 iterations to complete 1 epoch.
    The default value is 10.

`predict(features)`

The predict function takes in a set of features and returns the predicted class labels using a Gaussian Naive Bayes classifier.

Parameters

features : np.ndarray
    The `features` parameter is a np.ndarray that represents the input features for training the
    model. It has a shape of `(num_samples, num_features)`, where `num_samples` is the number of
    training samples and `num_features` is the number of features for each sample

Returns

predictions : np.ndarray
    The `predictions` parameter is a np.ndarray that contains the predicted class labels for each
    data point in the `features` array. Each element in the `predictions` array corresponds to the
    predicted class label of the corresponding data point in the `features` array.

`classification.naive_bayes.bernoulli`

`BernoulliNaiveBayes`

Defines a class for a probabilistic classifier that can be trained on features and labels, and used to make predictions and evaluate the accuracy of the predictions.

Bernoulli Naive Bayes is a probabilistic classifier that assumes that the features are binary (0 or 1). It is based on Bayes' theorem, which states that the probability of a hypothesis (class) given the data (features) is equal to the probability of the data given the hypothesis multiplied by the probability of the hypothesis divided by the probability of the data. In other words, it is the posterior probability of a hypothesis given the data is equal to the likelihood of the data given the hypothesis multiplied by the prior probability of the hypothesis divided by the marginal likelihood of the data.

`evaluate(features, labels)`

The evaluate function calculates the accuracy of the predictions made by the model.

Parameters

features : np.ndarray
    The `features` parameter is a np.ndarray that represents the input features for training the
    model. It has a shape of `(num_samples, num_features)`, where `num_samples` is the number of
    training samples and `num_features` is the number of features for each sample

labels : np.ndarray
    The "labels" parameter is a np.ndarray that contains the class labels for each data point in the
    "features" array. Each element in the "labels" array corresponds to the class label of the
    corresponding data point in the "features" array

Returns

accuracy : float
    The `accuracy` parameter is a float that represents the accuracy of the model on the given
    features and labels.

`fit(features, labels, epochs=100, learning_rate=0.01, verbose=True, smoothing_factor=0.9)`

The fit function trains a probabilistic classifier using the given features and labels, optimizing the feature probabilities using gradient descent with momentum-like update.

Parameters

features : np.ndarray
    The `features` parameter is a np.ndarray that represents the input features for training the
    model. It has a shape of `(num_samples, num_features)`, where `num_samples` is the number of
    training samples and `num_features` is the number of features for each sample

labels : np.ndarray
    The "labels" parameter is a np.ndarray that contains the class labels for each data point in the
    "features" array. Each element in the "labels" array corresponds to the class label of the
    corresponding data point in the "features" array

epochs : int
    The `epochs` parameter is an integer that represents the number of epochs to train the model for.
    An epoch is one iteration over the entire training dataset.

learning_rate : float
    The `learning_rate` parameter is a float that controls the size of the gradient descent step.

verbose : bool
    The `verbose` parameter is a boolean that controls whether or not to print the training accuracy
    for each epoch.

smoothing_factor : float
    The `smoothing_factor` parameter is a float that controls the amount of smoothing to apply to the
    feature probabilities. It is used to prevent the probabilities from becoming too extreme.

`predict(features)`

The predict function takes in a set of features, converts them to binary values, calculates the log probabilities for each class, and returns the predicted class for each sample.

Parameters

features : np.ndarray
    The `features` parameter is a np.ndarray that represents the input features for training the
    model. It has a shape of `(num_samples, num_features)`, where `num_samples` is the number of

`classification.logistic_regression`

`LogisticRegression`

Defines a class that implements a logistic regression model with various parameters and methods for training, testing, and evaluating the model.

Logistic regression is a classification algorithm used to predict the probability of a categorical dependent variable. In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). In other words, the logistic regression model predicts P(Y=1) as a function of X.

`evaluate(X, y)`

The evaluate function returns the score of a model on a given dataset.

Parameters

X : np.ndarray
    The parameter X represents the input data or features that will be used to make predictions or
    classifications. It could be a matrix or an array-like object

y : np.ndarray
    The parameter `y` represents the target variable or the dependent variable. It is the variable
    that we are trying to predict or model. In the context of machine learning, `y` typically
    represents the labels or classes of the data

Returns

tuple(float, float)
    The function returns a tuple containing the accuracy and loss of the model on the given dataset.

`fit(X, y, X_val=None, y_val=None, random_seed=None)`

The fit function trains a linear regression model using Mini-batch Gradient Descent with early stopping and learning rate scheduling.

Parameters

X : np.ndarray
    The parameter X represents the input features or data that you want to use to train the model

y : np.ndarray
    The parameter `y` represents the target variable or the dependent variable in the supervised
    learning problem. It is a np.ndarray containing the true values of the target variable for the
    corresponding samples in the input data `X`

X_val : np.ndarray, optional
    The parameter `X_val` represents the input features or data that you want to use to validate the
    model. It is a np.ndarray containing the features of the validation set

y_val : np.ndarray, optional
    The parameter `y_val` represents the target variable or the dependent variable in the supervised
    learning problem. It is a np.ndarray containing the true values of the target variable for the
    corresponding samples in the input data `X_val`

random_seed : int, optional
    The parameter `random_seed` is used to set the random seed for reproducibility. By setting a
    specific value for `random_seed`, you can ensure that the random initialization of the model's
    weights and any other random operations are consistent across different runs of the code.

`get_params()`

The function get_params returns a dictionary containing the values of various parameters.

Returns

dict
    A dictionary containing the values of the learning_rate, num_epochs, reg_strength, batch_size,

`predict(X)`

The predict function takes in a set of input data X and returns the predicted class labels based on the highest probability from the predict_proba function.

Parameters

X : np.ndarray
    The parameter X represents the input data for which you want to make predictions. It could be a
    single data point or a collection of data points. The shape of X should match the shape of the
    training data used to train the model

Returns

np.ndarray
    The `predict` function returns a np.ndarray containing the predicted class labels for each data
    point in `X`.

`predict_proba(X)`

The predict_proba function takes in a set of features X, scales the features, adds a column of ones to the scaled features, performs matrix multiplication with the coefficients, applies the sigmoid function to the logits, and returns the probabilities.

Parameters

X : np.ndarray
    The parameter X represents the input data for which you want to make predictions. It could be a
    single data point or a collection of data points. The shape of X should match the shape of the
    training data used to train the model

Returns

np.ndarray
    The `predict_proba` function returns a np.ndarray containing the predicted probabilities for
    each class.

`score(X, y)`

The function calculates the accuracy and loss of a binary classification model using TensorFlow's binary cross-entropy loss function.

Parameters

X : np.ndarray
    The parameter X represents the input data or features that will be used to make predictions or
    classifications. It could be a matrix or an array-like object

y : np.ndarray
    The parameter `y` represents the target variable or the dependent variable. It is the variable
    that we are trying to predict or model. In the context of machine learning, `y` typically
    represents the labels or classes of the data

Returns

tuple(float, float)
    The function returns a tuple containing the accuracy and loss of the model on the given dataset.

`set_params(params)`

The function sets the parameters for a machine learning model, including learning rate, number of epochs, regularization strength, batch size, early stopping patience, regularization method, and tolerance.

Parameters

params : dict
    A dictionary containing the values of the learning_rate, num_epochs, reg_strength, batch_size,
    early_stopping_patience, regularization, and tolerance attributes.

`sigmoid(z)`

The sigmoid function returns the value of 1 divided by 1 plus the exponential of the negative input value.

Sigmoid function is a mathematical function that takes any real value and maps it to a value between 0 and 1. It is a non-linear function used for binary classification tasks.

Parameters

z : tf.Tensor
    The parameter "z" is a tensor representing the input to the sigmoid function. It can be a 1D or
    2D tensor

Returns

tf.Tensor
    The sigmoid function is being returned.

`softmax(z)`

The softmax function takes in a vector of values and returns a vector of probabilities that sum up to 1.

Softmax function is a mathematical function that takes a vector of real numbers and normalizes it into a probability distribution consisting of probabilities proportional to the exponentials of the input numbers. In other words, the softmax function converts a vector of numbers into a vector of probabilities that sum up to 1.

Parameters

z : tf.Tensor
    The parameter "z" is a tensor representing the input to the softmax function. It can be a 1D or
    2D tensor

Returns

tf.Tensor
    The softmax function is being returned.

`train_test_split(X, y, test_size=0.2, random_state=None)`

The function train_test_split splits the input data X and target variable y into training and testing sets based on the specified test size and random state.

Parameters

X : np.ndarray
    The parameter X represents the input features or data that you want to split into training and
    testing sets

y : np.ndarray
    The parameter `y` represents the target variable or the dependent variable in the supervised
    learning problem. It is a np.ndarray containing the true values of the target variable for the
    corresponding samples in the input data `X`

test_size : float, optional
    The parameter `test_size` represents the proportion of the dataset that should be allocated to
    the test set. The default value is 0.2, which means that 20% of the data will be used for testing

random_state : int, optional
    The parameter `random_state` is used to set the random seed for reproducibility. By setting a
    specific value for `random_state`, you can ensure that the random initialization of the model's
    weights and any other random operations are consistent across different runs of the code.

Returns

np.ndarray
    The function returns four np.ndarrays: X_train, X_test, y_train, and y_test. X_train and y_train

`classification.decision_tree`

`DecisionTree`

Defines a class that implements a machine learning model using TensorFlow Decision Forests (TF-DF) for classification or regression tasks, including methods for loading datasets, training the model, making predictions, and evaluating the model's performance.

TensorFlow Decision Forests (TF-DF) is a library for training and serving TensorFlow models for decision tasks. It is an open-source library that provides a collection of state-of-the-art algorithms for decision tasks, including classification, regression, ranking, and clustering.

Decision trees (DT) are a type of supervised learning algorithm that can be used for both classification and regression tasks. They are a popular choice for many machine learning problems because they are easy to understand and interpret, and they can be used to solve a wide variety of problems.

CART (Classification and Regression Trees) is a decision tree learning algorithm that uses the Gini impurity measure to determine the best split at each node. It is a popular choice for many machine learning problems because it is easy to understand and interpret, and it can be used to solve a wide variety of problems.

Random forests (RFs) are an ensemble learning method that combines multiple decision trees to create a more powerful model. They are a popular choice for many machine learning problems because they are easy to understand and interpret, and they can be used to solve a wide variety of problems.

Gradient Boosted Trees (GBTs) are a type of supervised learning algorithm that can be used for both classification and regression tasks. They are a popular choice for many machine learning problems because they are easy to understand and interpret, and they can be used to solve a wide variety of problems.

`evaluate()`

The evaluate function evaluates a machine learning model on a test dataset and returns the evaluation metrics.

Returns

The evaluation results of the model on the test dataset. It returns a dictionary with the evaluation
metrics as keys and their corresponding values.

`fit(early_stopping_patience=5, learning_rate=0.001, momentum=0.9, _metrics=['accuracy'])`

The fit function trains a model using stochastic gradient descent optimizer with early stopping and specified hyperparameters.

Parameters

early_stopping_patience : int, optional
    The early_stopping_patience parameter determines the number of epochs to wait before stopping
    the training process if the validation loss does not improve. If the validation loss does not
    improve for the specified number of epochs, training will be stopped early, defaults to 5
    (optional)

learning_rate : float, optional
    The learning rate determines the step size at each iteration while training the model. It
    controls how much the model's weights are updated based on the calculated gradients. A higher
    learning rate can result in faster convergence but may also cause the model to overshoot the
    optimal solution. On the other hand, a lower learning rate can result in slower convergence but
    may also result in a more stable model, defaults to 0.001 (optional)

momentum : float, optional
    Momentum is a hyperparameter used in optimization algorithms, such as Stochastic Gradient
    Descent (SGD), to accelerate convergence and escape local minima. It determines the contribution
    of the previous update to the current update of the model's weights, defaults to 0.9 (optional)

_metrics : list, optional
    _metrics is a list of metrics that will be used to evaluate the model's performance during
    training and validation. These metrics can include accuracy, precision, recall, F1 score, etc,
    defaults to ["accuracy"] (optional)

`get_params()`

The function get_params returns the configuration of the model. Returns

The configuration of the model.

`info()`

The info function returns a summary of the model. Returns

A summary of the model.

`load_dataset(dataset_df, label, test_ratio=0.2)`

The function load_dataset takes a dataset dataframe, splits it into train, validation, and test sets, converts them into TensorFlow datasets, and calculates class weights.

Parameters

dataset_df : pandas.DataFrame
    The dataset_df parameter is a pandas DataFrame that contains the dataset you want to load. It
    should have the features as columns and the corresponding labels as a separate column

label : str
    The "label" parameter is the column name of the target variable in the dataset. It is the
    variable that you want to predict or classify

test_ratio : float, optional
    The `test_ratio` parameter is the ratio of the dataset that should be used for testing. It
    determines the proportion of the dataset that will be split into the test set. The remaining
    portion of the dataset will be used for training and validation, by default 0.2

`predict(length=3, split='test')`

The predict function takes in a length and split parameter, and returns the predictions made by the model on the specified dataset split.

Parameters

length : int, optional
    The `length` parameter specifies the number of samples to be taken from the dataset. It
    determines how many samples will be used for prediction, defaults to 3 (optional)

split : str, optional
    The "split" parameter determines whether to use the test or train dataset for prediction. If
    "split" is set to "test", the function will use the test dataset and if it is set to "train",
    the function will use the train dataset, defaults to test (optional)

Returns

The predictions made by the model on the specified dataset (either the test dataset or the train
dataset).