Category

Classification

 

Introduction

Classification is a machine learning technique that constructs a model to predict discrete values for each row in your data. These models predict outcomes with values like yes/no, true/false, or binary numerical values. Utilizing various classification algorithms for model comparison is recommended.

Overview

In Lumenore’s “Do You Know”, the Classification process involves two steps. Initially, you build a model by specifying input and target columns from an uploaded dataset. Once the model is created, it can be applied to another dataset that shares the same input columns, predicting the target column values.

Note: Ensure the schema contains at least two variables for classification analysis.

Below are the steps to perform classification analysis:

Step 1: After accessing “Do You Know”, select “Classification”.

Step 2: Now, click on “Create New Insight-Trend”.

Step 3: Choose a Schema and proceed by clicking on “Next”.

Note: The schema signifies the dataset for analysis. If absent, create one, ensuring prerequisites (A KPI, Date, and Attribute) are met.

Step 4: Select the following:

  • Select insight approach: Choose the insight approach, providing two options: “Build a new machine learning model” or “Reuse an existing model.” To create a new model, select the first option. If you’ve previously built a Classification model, choose the second option to generate forecasts using that model.

For now, we are selecting to build a new machine-learning model.

  • Select model: It is deactivated for “Build a new machine learning model” and activated for “Reuse an existing model.”
  • Select output variable: Choose the output variable for the classification model. Ensure it comprises precisely two distinct values (a binary variable).
  • Select unique identifiers: In this context, the chosen variable(s) will serve the purpose of uniquely identifying rows and won’t be utilized for constructing the model.
  • Select input variables: Kindly choose the input variable(s) required for constructing the classification model.
  • Algorithm: Choose a classification algorithm to develop your model. Do You Know offers various widely utilized ML algorithms such as Logistic Regression, Decision Tree, Random Forest, and XG Boost.
    • Logistic Regression: Logistic regression is a statistical technique used for binary classification, where the aim is to predict one of two possible outcomes based on input features. This supervised machine learning algorithm estimates the probability that an input belongs to a particular class. The output is a probability value between 0 and 1, where values near 0 represent low probability and those near 1 represent high probability. Typically, a threshold, often set at 0.5, classifies data points based on predicted probabilities. Logistic regression is known for its simplicity, interpretability, and scalability to handle large datasets. It can also be extended to solve multi-class classification problems through methods like one-vs-rest or multinomial logistic regression.
    • Decision Tree: The decision tree algorithm is employed for both classification and regression tasks. It constructs a tree-like structure by recursively dividing the data based on input feature values to make decisions or forecasts. Decision trees are advantageous for their interpretability and ability to handle categorical and continuous features. However, they can be susceptible to overfitting, especially as the tree complexity increases.
    • Random Forest: Random Forest is an ensemble learning technique used for classification and regression tasks. It creates an ensemble of decision trees by training each tree on random subsets of both the training data and features. This approach introduces randomness into the model, mitigating overfitting, which is a common problem with individual decision trees.
    • XG Boost: XGBoost, short for eXtreme Gradient Boosting, is a robust gradient boosting framework popular for its performance in machine learning competitions and real-world applications. It employs an optimized implementation of gradient boosting, sequentially constructing decision tree ensembles. Each subsequent tree aims to rectify the mistakes of its predecessor. XGBoost focuses on optimized tree construction and regularization techniques to prevent overfitting.
  • Split ratio: The split ratio train-test method is employed to assess a model’s performance. This technique divides the data into two sets: a training set and a testing set. The training set educates the model, while the testing set evaluates its performance.

Note: A 70:30 split implies that 70% of the data is allocated for training, and the remaining 30% is reserved for testing. The split ratio is typically determined based on the dataset’s size and the model’s complexity. This approach offers a significant advantage by allowing the model to train and validate on distinct datasets. This helps prevent overfitting, a scenario where a model excessively fits the training data, resulting in poor performance when presented with new, unseen data.

  • Do you want to add filters?: Optionally add filters.
  • Advance settings (Optional): High cardinality in a column signifies that the column encompasses numerous distinct values, which isn’t favourable for constructing classification models. Conversely, low-variance columns are also unsuitable for model building. Therefore, these settings enable the user to exclude such columns from the dataset.

Click “Next”. 

Step 5: Users can tailor the insights narrative, outlining all the variables utilized in crafting the insight. Then, click on “Save”.

Step 6: Name the insight for future access (default suggestion provided) and save it.

Step 7: A new window appears; click “Execute Now” to generate insights.

 

Output (Insights)

To enhance comprehension, we’ve developed a classification model focusing on the Department as the output variable, using gender as a unique identifier, education as an input variable, and logistic regression as the algorithm. Utilizing an 80% training and 20% testing split ratio, the model aims to classify gender-wise education within various departments.

 

Following the classification analysis conducted using the created model, users can leverage the same model to generate a forecast for similar data.

Re-use already available model

Step 1: After choosing schema, select “Reuse an existing model” in the insight approach, then select the previously created model for classification.

Click “Next”.

Step 2: Users can tailor the insights narrative, outlining all the variables utilized in crafting the insight. Then, click on “Save”.

Step 3: Name the insight for future access (default suggestion provided) and save it.

Step 4: A new window appears; click “Execute Now” to generate insights.

 

Output (Insights)

The forecast analysis insight in classification displays, showcasing the forecast on classified variables.

Click on the three dots located at the top right corner, then select “Convert to grid”.This action will display the classified data along with the forecasted value.