Classification
classification.naive_bayes.gaussian
GaussianNaiveBayes
Defines a Gaussian Naive Bayes classifier in Python using TensorFlow.
Gaussian Naive Bayes is a classification algorithm that assumes that the features follow a normal distribution. It is a variant of the Naive Bayes algorithm that is used for classification tasks. It is called "naive" because it assumes that the features are independent of each other. This assumption is called "naive" because it is rarely true in real-world applications. However, despite this assumption, Gaussian Naive Bayes performs surprisingly well in many cases.
evaluate(features, labels)
The evaluate function calculates the accuracy of the predictions made by the model.
Parameters
features : np.ndarray
The `features` parameter is a np.ndarray that represents the input features for training the
model. It has a shape of `(num_samples, num_features)`, where `num_samples` is the number of
training samples and `num_features` is the number of features for each sample
labels : np.ndarray
The "labels" parameter is a np.ndarray that contains the class labels for each data point in the
"features" array. Each element in the "labels" array corresponds to the class label of the
corresponding data point in the "features" array
Returns
accuracy : float
The `accuracy` parameter is a float that represents the accuracy of the predictions made by the
model.
fit(features, labels, epochs=10)
The fit
function trains a Gaussian Naive Bayes classifier using TensorFlow by calculating class
priors and feature parameters, and optimizing them using gradient descent.
Parameters
features : np.ndarray
The `features` parameter is a np.ndarray that represents the input features for training the
model. It has a shape of `(num_samples, num_features)`, where `num_samples` is the number of
training samples and `num_features` is the number of features for each sample
labels : np.ndarray
The "labels" parameter is a np.ndarray that contains the class labels for each data point in the
"features" array. Each element in the "labels" array corresponds to the class label of the
corresponding data point in the "features" array
epochs : int
The `epochs` parameter is an integer that represents the number of epochs to train the model for.
An epoch is one iteration over the entire training dataset. For example, if the training dataset
has 1000 samples and the batch size is 100, then it will take 10 iterations to complete 1 epoch.
The default value is 10.
predict(features)
The predict
function takes in a set of features and returns the predicted class labels using a
Gaussian Naive Bayes classifier.
Parameters
features : np.ndarray
The `features` parameter is a np.ndarray that represents the input features for training the
model. It has a shape of `(num_samples, num_features)`, where `num_samples` is the number of
training samples and `num_features` is the number of features for each sample
Returns
predictions : np.ndarray
The `predictions` parameter is a np.ndarray that contains the predicted class labels for each
data point in the `features` array. Each element in the `predictions` array corresponds to the
predicted class label of the corresponding data point in the `features` array.
classification.naive_bayes.bernoulli
BernoulliNaiveBayes
Defines a class for a probabilistic classifier that can be trained on features and labels, and used to make predictions and evaluate the accuracy of the predictions.
Bernoulli Naive Bayes is a probabilistic classifier that assumes that the features are binary (0 or 1). It is based on Bayes' theorem, which states that the probability of a hypothesis (class) given the data (features) is equal to the probability of the data given the hypothesis multiplied by the probability of the hypothesis divided by the probability of the data. In other words, it is the posterior probability of a hypothesis given the data is equal to the likelihood of the data given the hypothesis multiplied by the prior probability of the hypothesis divided by the marginal likelihood of the data.
evaluate(features, labels)
The evaluate function calculates the accuracy of the predictions made by the model.
Parameters
features : np.ndarray
The `features` parameter is a np.ndarray that represents the input features for training the
model. It has a shape of `(num_samples, num_features)`, where `num_samples` is the number of
training samples and `num_features` is the number of features for each sample
labels : np.ndarray
The "labels" parameter is a np.ndarray that contains the class labels for each data point in the
"features" array. Each element in the "labels" array corresponds to the class label of the
corresponding data point in the "features" array
Returns
accuracy : float
The `accuracy` parameter is a float that represents the accuracy of the model on the given
features and labels.
fit(features, labels, epochs=100, learning_rate=0.01, verbose=True, smoothing_factor=0.9)
The fit
function trains a probabilistic classifier using the given features and labels, optimizing
the feature probabilities using gradient descent with momentum-like update.
Parameters
features : np.ndarray
The `features` parameter is a np.ndarray that represents the input features for training the
model. It has a shape of `(num_samples, num_features)`, where `num_samples` is the number of
training samples and `num_features` is the number of features for each sample
labels : np.ndarray
The "labels" parameter is a np.ndarray that contains the class labels for each data point in the
"features" array. Each element in the "labels" array corresponds to the class label of the
corresponding data point in the "features" array
epochs : int
The `epochs` parameter is an integer that represents the number of epochs to train the model for.
An epoch is one iteration over the entire training dataset.
learning_rate : float
The `learning_rate` parameter is a float that controls the size of the gradient descent step.
verbose : bool
The `verbose` parameter is a boolean that controls whether or not to print the training accuracy
for each epoch.
smoothing_factor : float
The `smoothing_factor` parameter is a float that controls the amount of smoothing to apply to the
feature probabilities. It is used to prevent the probabilities from becoming too extreme.
predict(features)
The predict
function takes in a set of features, converts them to binary values, calculates the
log probabilities for each class, and returns the predicted class for each sample.
Parameters
features : np.ndarray
The `features` parameter is a np.ndarray that represents the input features for training the
model. It has a shape of `(num_samples, num_features)`, where `num_samples` is the number of
classification.logistic_regression
LogisticRegression
Defines a class that implements a logistic regression model with various parameters and methods for training, testing, and evaluating the model.
Logistic regression is a classification algorithm used to predict the probability of a categorical dependent variable. In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). In other words, the logistic regression model predicts P(Y=1) as a function of X.
evaluate(X, y)
The evaluate function returns the score of a model on a given dataset.
Parameters
X : np.ndarray
The parameter X represents the input data or features that will be used to make predictions or
classifications. It could be a matrix or an array-like object
y : np.ndarray
The parameter `y` represents the target variable or the dependent variable. It is the variable
that we are trying to predict or model. In the context of machine learning, `y` typically
represents the labels or classes of the data
Returns
tuple(float, float)
The function returns a tuple containing the accuracy and loss of the model on the given dataset.
fit(X, y, X_val=None, y_val=None, random_seed=None)
The fit
function trains a linear regression model using Mini-batch Gradient Descent with early
stopping and learning rate scheduling.
Parameters
X : np.ndarray
The parameter X represents the input features or data that you want to use to train the model
y : np.ndarray
The parameter `y` represents the target variable or the dependent variable in the supervised
learning problem. It is a np.ndarray containing the true values of the target variable for the
corresponding samples in the input data `X`
X_val : np.ndarray, optional
The parameter `X_val` represents the input features or data that you want to use to validate the
model. It is a np.ndarray containing the features of the validation set
y_val : np.ndarray, optional
The parameter `y_val` represents the target variable or the dependent variable in the supervised
learning problem. It is a np.ndarray containing the true values of the target variable for the
corresponding samples in the input data `X_val`
random_seed : int, optional
The parameter `random_seed` is used to set the random seed for reproducibility. By setting a
specific value for `random_seed`, you can ensure that the random initialization of the model's
weights and any other random operations are consistent across different runs of the code.
get_params()
The function get_params
returns a dictionary containing the values of various parameters.
Returns
dict
A dictionary containing the values of the learning_rate, num_epochs, reg_strength, batch_size,
predict(X)
The predict
function takes in a set of input data X
and returns the predicted class labels based
on the highest probability from the predict_proba
function.
Parameters
X : np.ndarray
The parameter X represents the input data for which you want to make predictions. It could be a
single data point or a collection of data points. The shape of X should match the shape of the
training data used to train the model
Returns
np.ndarray
The `predict` function returns a np.ndarray containing the predicted class labels for each data
point in `X`.
predict_proba(X)
The predict_proba
function takes in a set of features X
, scales the features, adds a column of
ones to the scaled features, performs matrix multiplication with the coefficients, applies the
sigmoid function to the logits, and returns the probabilities.
Parameters
X : np.ndarray
The parameter X represents the input data for which you want to make predictions. It could be a
single data point or a collection of data points. The shape of X should match the shape of the
training data used to train the model
Returns
np.ndarray
The `predict_proba` function returns a np.ndarray containing the predicted probabilities for
each class.
score(X, y)
The function calculates the accuracy and loss of a binary classification model using TensorFlow's binary cross-entropy loss function.
Parameters
X : np.ndarray
The parameter X represents the input data or features that will be used to make predictions or
classifications. It could be a matrix or an array-like object
y : np.ndarray
The parameter `y` represents the target variable or the dependent variable. It is the variable
that we are trying to predict or model. In the context of machine learning, `y` typically
represents the labels or classes of the data
Returns
tuple(float, float)
The function returns a tuple containing the accuracy and loss of the model on the given dataset.
set_params(params)
The function sets the parameters for a machine learning model, including learning rate, number of epochs, regularization strength, batch size, early stopping patience, regularization method, and tolerance.
Parameters
params : dict
A dictionary containing the values of the learning_rate, num_epochs, reg_strength, batch_size,
early_stopping_patience, regularization, and tolerance attributes.
sigmoid(z)
The sigmoid function returns the value of 1 divided by 1 plus the exponential of the negative input value.
Sigmoid function is a mathematical function that takes any real value and maps it to a value between 0 and 1. It is a non-linear function used for binary classification tasks.
Parameters
z : tf.Tensor
The parameter "z" is a tensor representing the input to the sigmoid function. It can be a 1D or
2D tensor
Returns
tf.Tensor
The sigmoid function is being returned.
softmax(z)
The softmax function takes in a vector of values and returns a vector of probabilities that sum up to 1.
Softmax function is a mathematical function that takes a vector of real numbers and normalizes it into a probability distribution consisting of probabilities proportional to the exponentials of the input numbers. In other words, the softmax function converts a vector of numbers into a vector of probabilities that sum up to 1.
Parameters
z : tf.Tensor
The parameter "z" is a tensor representing the input to the softmax function. It can be a 1D or
2D tensor
Returns
tf.Tensor
The softmax function is being returned.
train_test_split(X, y, test_size=0.2, random_state=None)
The function train_test_split
splits the input data X
and target variable y
into training and
testing sets based on the specified test size and random state.
Parameters
X : np.ndarray
The parameter X represents the input features or data that you want to split into training and
testing sets
y : np.ndarray
The parameter `y` represents the target variable or the dependent variable in the supervised
learning problem. It is a np.ndarray containing the true values of the target variable for the
corresponding samples in the input data `X`
test_size : float, optional
The parameter `test_size` represents the proportion of the dataset that should be allocated to
the test set. The default value is 0.2, which means that 20% of the data will be used for testing
random_state : int, optional
The parameter `random_state` is used to set the random seed for reproducibility. By setting a
specific value for `random_state`, you can ensure that the random initialization of the model's
weights and any other random operations are consistent across different runs of the code.
Returns
np.ndarray
The function returns four np.ndarrays: X_train, X_test, y_train, and y_test. X_train and y_train
classification.decision_tree
DecisionTree
Defines a class that implements a machine learning model using TensorFlow Decision Forests (TF-DF) for classification or regression tasks, including methods for loading datasets, training the model, making predictions, and evaluating the model's performance.
TensorFlow Decision Forests (TF-DF) is a library for training and serving TensorFlow models for decision tasks. It is an open-source library that provides a collection of state-of-the-art algorithms for decision tasks, including classification, regression, ranking, and clustering.
Decision trees (DT) are a type of supervised learning algorithm that can be used for both classification and regression tasks. They are a popular choice for many machine learning problems because they are easy to understand and interpret, and they can be used to solve a wide variety of problems.
CART (Classification and Regression Trees) is a decision tree learning algorithm that uses the Gini impurity measure to determine the best split at each node. It is a popular choice for many machine learning problems because it is easy to understand and interpret, and it can be used to solve a wide variety of problems.
Random forests (RFs) are an ensemble learning method that combines multiple decision trees to create a more powerful model. They are a popular choice for many machine learning problems because they are easy to understand and interpret, and they can be used to solve a wide variety of problems.
Gradient Boosted Trees (GBTs) are a type of supervised learning algorithm that can be used for both classification and regression tasks. They are a popular choice for many machine learning problems because they are easy to understand and interpret, and they can be used to solve a wide variety of problems.
evaluate()
The evaluate
function evaluates a machine learning model on a test dataset and returns the
evaluation metrics.
Returns
The evaluation results of the model on the test dataset. It returns a dictionary with the evaluation
metrics as keys and their corresponding values.
fit(early_stopping_patience=5, learning_rate=0.001, momentum=0.9, _metrics=['accuracy'])
The fit
function trains a model using stochastic gradient descent optimizer with early stopping
and specified hyperparameters.
Parameters
early_stopping_patience : int, optional
The early_stopping_patience parameter determines the number of epochs to wait before stopping
the training process if the validation loss does not improve. If the validation loss does not
improve for the specified number of epochs, training will be stopped early, defaults to 5
(optional)
learning_rate : float, optional
The learning rate determines the step size at each iteration while training the model. It
controls how much the model's weights are updated based on the calculated gradients. A higher
learning rate can result in faster convergence but may also cause the model to overshoot the
optimal solution. On the other hand, a lower learning rate can result in slower convergence but
may also result in a more stable model, defaults to 0.001 (optional)
momentum : float, optional
Momentum is a hyperparameter used in optimization algorithms, such as Stochastic Gradient
Descent (SGD), to accelerate convergence and escape local minima. It determines the contribution
of the previous update to the current update of the model's weights, defaults to 0.9 (optional)
_metrics : list, optional
_metrics is a list of metrics that will be used to evaluate the model's performance during
training and validation. These metrics can include accuracy, precision, recall, F1 score, etc,
defaults to ["accuracy"] (optional)
get_params()
The function get_params
returns the configuration of the model.
Returns
The configuration of the model.
info()
The info
function returns a summary of the model.
Returns
A summary of the model.
load_dataset(dataset_df, label, test_ratio=0.2)
The function load_dataset
takes a dataset dataframe, splits it into train, validation, and test
sets, converts them into TensorFlow datasets, and calculates class weights.
Parameters
dataset_df : pandas.DataFrame
The dataset_df parameter is a pandas DataFrame that contains the dataset you want to load. It
should have the features as columns and the corresponding labels as a separate column
label : str
The "label" parameter is the column name of the target variable in the dataset. It is the
variable that you want to predict or classify
test_ratio : float, optional
The `test_ratio` parameter is the ratio of the dataset that should be used for testing. It
determines the proportion of the dataset that will be split into the test set. The remaining
portion of the dataset will be used for training and validation, by default 0.2
predict(length=3, split='test')
The predict
function takes in a length and split parameter, and returns the predictions made by
the model on the specified dataset split.
Parameters
length : int, optional
The `length` parameter specifies the number of samples to be taken from the dataset. It
determines how many samples will be used for prediction, defaults to 3 (optional)
split : str, optional
The "split" parameter determines whether to use the test or train dataset for prediction. If
"split" is set to "test", the function will use the test dataset and if it is set to "train",
the function will use the train dataset, defaults to test (optional)
Returns
The predictions made by the model on the specified dataset (either the test dataset or the train
dataset).