Generate Simulated Dataset for Regression Problems


Imagine that you want to replicate a dataset based on a specific behavior. You already know that data should perform a quadratic equation with the minimum in x = 500 with positive values. However, is there any way to simulate a similar scenario with Python?

This tutorial aims to learn how to create a sample dataset for regression problems. I will be covering linear regression and non-linear regression equations (polynomic, exponential…). I will continue updating this tutorial with new regression problems in the future. You can find the notebook for this tutorial on my GitHub account.

Linear Regression

Simple Regression

As you might know, linear regression is based on linear equations with the following form:

y = ax + b

where a is the slope and b is the cut in the y-axis.

To build a linear equation, an option is to use the function make_regression() from the Sklearn library to create samples of X and Y. The mean parameters you can add to this function are:

  • n_sample: number of samples
  • n_features: number of variables
  • n_informative: number of informative variables to create the output
  • n_targets: number of regression targets
  • noise: standard deviation of the output
  • random_state: the seed to control the randomness of the output

You can take a look at the rest of the parameters on the Sklearn documentation.

First of all, let’s import the libraries:

import pandas as pd
import numpy as np
import seaborn as sns
from sklearn import datasets
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression

Now, let’s apply the make_regression() function:

x, y = datasets.make_regression(n_samples = 200, n_features = 1,
                                n_informative = 1, n_targets = 1,
                                noise = 20, random_state=12345, effective_rank=None)

If you want to determine in which range X and Y values are moving, you can use a Numpy function called np.intern() specifying the minimum and the maximum for each one:

x = np.interp(x, (x.min(), x.max()), (3876, 15678))
y = np.interp(y, (y.min(), y.max()), (1678, 5435))

Let’s plot the function to see what it looks like:

Linear Equation

Non-linear Regression

Polynomial Regression

In the case of polynomic regression, we need to apply a more complex methodology. For this example, I will be calculating a cubic equation. So, we are looking for an equation that has the following form:

y = ax^3 + bx^2 + cx + d

First, let’s determine the X coordinates of the maximum and the minimum values that we want to build the first derivative of our desired function.

To build my function for this specific exercise, I will use max_x = 3000 and min_y = 5000.

Then, create the factorial function based on those values. In this case, it will be the following one:

f'(x) = (x – 3000)(x – 5000)

After developing, we have the following equation:

f'(x) = x^2 – 8000x + 1.5·10^7

As I said, this is the first derivative. To get our cubic equation, it’s necessary to integrate using the Scipy and Sympy libraries. Let’s import them:

import scipy as sp
from sympy import *

Now, I will create the symbol for X as it will be our unknown factor:

init_printing(use_unicode=False, wrap_line=False)
x = Symbol('x')

In the next step, I will integrate the first derivative to get the cubic equation:

integrate(x**2 - 8000*x + 1.5*(10**7), x)

This is the result of the integral that will be the base to build our cubic equation:

y = 1/3x^3 − 4000x^2 + 15000000x

All equations have different types of transformations. 

  • The first one is changing the D value. This will move the function up and down over the coordinate axis.
  • The second one is to multiply or divide the whole function to stretch or flatten it.

Let’s plot the resultant equation to see what it looks like by defining a function:

def plot_me(a, b, c, d):
    x = np.arange(0, 7000, 0.05)
    y = [(a*i**3 + b*i**2 + c*i + d) for i in x]
    plt.plot(x, y, label='cubic', linestyle='-')

plot_me(1/3, -4000, 15000000, 0)
Cubic Equation

Let’s do some transformations over the equation, so the range for the Y value is between 0 and 5000. To do so, I will divide the whole equation by 1500000. Also, I will move it 3000 units up (this means d = 3000).

plot_me(0.3333333/15000000, -4000/15000000, 15000000/15000000, 3000)
Cubic Equation Modeled

In the next post, I will share with you how to do a similar approach with exponential and logarithmic equations.

This Post Has 19 Comments

  1. This is a topic which is close to my heart… Many thanks!
    Where are your contact details though?

  2. joker 777

    I got this web page from my pal who informed me about this web page and at
    the moment this time I am visiting this web page
    and reading very informative articles here.

  3. sbobet ca

    Hi, after reading this remarkable post i am as well
    cheerful to share my experience here with friends.

  4. phyteney

    Thank you a bunch for sharing this with all folks you really recognise what you’re speaking approximately!
    Bookmarked. Kindly also talk over with my web site =).
    We will have a link alternate contract between us

  5. 22bet

    Hi there, I discovered your site by way of Google even as searching for
    a related topic, your web site got here up, it looks great.
    I have bookmarked it in my google bookmarks.

    Hello there, just was alert to your blog thru Google, and found that it is really informative.
    I’m gonna be careful for brussels. I will appreciate if
    you proceed this in future. Many people will probably
    be benefited from your writing. Cheers!

    My page … 22bet

  6. paripesa

    It’s fantastic that you are getting thoughts from this article as well as from
    our discussion made at this time.

    my page :: paripesa

  7. บทความ

    Hello, Neat post. There’s an issue along with your website in web explorer,
    may test this? IE still is the marketplace
    chief and a good part of other folks will pass over your excellent writing because of
    this problem.

  8. I was curious if you ever thought of changing the
    layout of your site? Its very well written; I love what youve got to say.
    But maybe you could a little more in the way
    of content so people could connect with it better. Youve
    got an awful lot of text for only having one or two images.
    Maybe you could space it out better?

  9. Hey there! I know this is somewhat off topic but I was wondering if you knew where I
    could find a captcha plugin for my comment form? I’m using the same blog
    platform as yours and I’m having trouble finding one?

    Thanks a lot!

  10. Free Spins

    Hi! This is my first comment here so I just wanted to give a quick shout
    out and tell you I really enjoy reading through your articles.
    Can you recommend any other blogs/websites/forums that go over the
    same topics? Thanks for your time!

    Check out my web-site … Free Spins

  11. นมผึ้ง

    Nice post. I was checking constantly this blog and I’m
    impressed! Extremely helpful information specifically
    the last part 🙂 I handle such info a lot. I used to be seeking this
    particular info for a long time. Thanks and best of luck.

  12. I just like the helpful information you supply to your articles.
    I will bookmark your weblog and test once more right here frequently.

    I am rather sure I’ll be told many new stuff right right here!
    Best of luck for the following!

Leave a Reply