Regression is a machine learning technique employed to forecast the value of a target variable by analyzing one or more input variables or columns. In essence, it aids in making future predictions based on historical data.

For instance, suppose you possess a dataset on various houses, encompassing details such as their sizes and corresponding prices. Using regression, you can construct a model that takes the size of a house as input and estimates its price as output.

Subsequently, this model becomes capable of forecasting prices for other houses, drawing from their sizes. It essentially functions as a tool that extrapolates future outcomes by drawing insights from past information. Additionally, regression can involve multiple input variables to predict values for the target variable.

Steps to apply regression:

Step 1: After accessing “Do You Know”, select “Regression”.

Step 2: Now, click on “Create New Insight-Trend”.

Step 3: Choose a Schema and proceed by clicking on “Next”.

Note: The schema signifies the dataset for analysis. If absent, create one, ensuring prerequisites (A KPI, Date, and Attribute) are met.

Step 4: The user must make the following selections:

    • Choose the insight approach: There are two options available: “Build a new machine learning model” and “Reuse already available model.” Choose the first option to construct a model. If a model has already been developed in Classification, select the second option to generate forecasts using that model.
    • Select the model: This feature is disabled for “Build a new machine learning model” but becomes enabled for “Reuse already available model.”
    • Output variable: Choose the output variable for the regression model.
    • Unique Identifier(s): These selected variable(s) are not utilized for model development but serve to uniquely identify rows.
    • Choose input variable(s): Please specify the input variable(s) necessary to construct the regression model.
    • Algorithm Selection: Pick a classification algorithm to develop your model. Do You Know offers several widely used ML algorithms such as:
      • Linear Regression: It predicts a continuous numerical output variable based on one or more input features. Linear regression aims to find the best linear relationship between input features and output variables by assuming a linear relationship. The equation for a simple linear regression model with one input feature is y = mx + b, where y is the output variable, x is the input feature, m is the slope of the line, and b is the y-intercept.
      • Decision Tree: A supervised learning algorithm for classification and regression tasks, which constructs a tree-like structure by recursively splitting data based on input feature values to make decisions or forecasts. It’s interpretable and handles categorical and continuous features but can be prone to overfitting with complex trees.
      • Random Forest: An ensemble learning algorithm used for both classification and regression, involving the creation of multiple decision trees that combine outputs for forecasts. It mitigates overfitting by training trees on random subsets of data and features.
      • XG Boost: A powerful gradient boosting framework known for high performance, which constructs decision trees sequentially to correct the errors of the previous tree. It optimizes tree construction and model regularization to prevent overfitting.
    • Split Ratio (Train: Test) Selection: Split ratio train test is a technique for evaluating a model’s performance. It involves dividing data into training and testing sets, using the training set for model training and the testing set for evaluating its performance.

Click on “Save”. 

Step 5: Users can tailor the insights narrative, outlining all the variables utilized in crafting the insight. Then, click on “Save”.

Step 6: Name the insight for future access (default suggestion provided) and save it.

Step 7: A new window appears; click “Execute Now” to generate insights.


Output (Insights)

For clarity, we’ve constructed a regression model with profit as the output variable, customer ID as a unique identifier, and sales as an input variable, using linear regression as the algorithm, and adopting an 80% training and 20% testing split ratio (Train: Test) to analyze the relationship between profit and sales.