Imagine that you want to replicate a dataset based on a specific behavior. You already know that data should perform a quadratic equation with the minimum in x = 500 with positive values. However, is there any way to simulate a similar scenario with Python?
This tutorial aims to learn how to create a sample dataset for regression problems. I will be covering linear regression and non-linear regression equations (polynomic, exponential…). I will continue updating this tutorial with new regression problems in the future. You can find the notebook for this tutorial on my GitHub account.
As you might know, linear regression is based on linear equations with the following form:
y = ax + b
where a is the slope and b is the cut in the y-axis.
To build a linear equation, an option is to use the function make_regression() from the Sklearn library to create samples of X and Y. The mean parameters you can add to this function are:
- n_sample: number of samples
- n_features: number of variables
- n_informative: number of informative variables to create the output
- n_targets: number of regression targets
- noise: standard deviation of the output
- random_state: the seed to control the randomness of the output
You can take a look at the rest of the parameters on the Sklearn documentation.
First of all, let’s import the libraries:
import pandas as pd import numpy as np import seaborn as sns from sklearn import datasets import matplotlib.pyplot as plt from sklearn.datasets import make_regression
Now, let’s apply the make_regression() function:
x, y = datasets.make_regression(n_samples = 200, n_features = 1, n_informative = 1, n_targets = 1, noise = 20, random_state=12345, effective_rank=None)
If you want to determine in which range X and Y values are moving, you can use a Numpy function called np.intern() specifying the minimum and the maximum for each one:
x = np.interp(x, (x.min(), x.max()), (3876, 15678)) y = np.interp(y, (y.min(), y.max()), (1678, 5435))
Let’s plot the function to see what it looks like:
In the case of polynomic regression, we need to apply a more complex methodology. For this example, I will be calculating a cubic equation. So, we are looking for an equation that has the following form:
y = ax^3 + bx^2 + cx + d
First, let’s determine the X coordinates of the maximum and the minimum values that we want to build the first derivative of our desired function.
To build my function for this specific exercise, I will use max_x = 3000 and min_y = 5000.
Then, create the factorial function based on those values. In this case, it will be the following one:
f'(x) = (x – 3000)(x – 5000)
After developing, we have the following equation:
f'(x) = x^2 – 8000x + 1.5·10^7
As I said, this is the first derivative. To get our cubic equation, it’s necessary to integrate using the Scipy and Sympy libraries. Let’s import them:
import scipy as sp from sympy import *
Now, I will create the symbol for X as it will be our unknown factor:
init_printing(use_unicode=False, wrap_line=False) x = Symbol('x')
In the next step, I will integrate the first derivative to get the cubic equation:
integrate(x**2 - 8000*x + 1.5*(10**7), x)
This is the result of the integral that will be the base to build our cubic equation:
y = 1/3x^3 − 4000x^2 + 15000000x
All equations have different types of transformations.
- The first one is changing the D value. This will move the function up and down over the coordinate axis.
- The second one is to multiply or divide the whole function to stretch or flatten it.
Let’s plot the resultant equation to see what it looks like by defining a function:
def plot_me(a, b, c, d): x = np.arange(0, 7000, 0.05) y = [(a*i**3 + b*i**2 + c*i + d) for i in x] plt.plot(x, y, label='cubic', linestyle='-') plt.grid(True) plt.show(block=False) plt.pause(10) plt.close() plot_me(1/3, -4000, 15000000, 0)
Let’s do some transformations over the equation, so the range for the Y value is between 0 and 5000. To do so, I will divide the whole equation by 1500000. Also, I will move it 3000 units up (this means d = 3000).
plot_me(0.3333333/15000000, -4000/15000000, 15000000/15000000, 3000)
In the next post, I will share with you how to do a similar approach with exponential and logarithmic equations.
This Post Has 19 Comments
Thanks for finally talking about > Generate Simulated Dataset for Regression Problems – Hello
Data < Loved it!
This is a topic which is close to my heart… Many thanks!
Where are your contact details though?
I got this web page from my pal who informed me about this web page and at
the moment this time I am visiting this web page
and reading very informative articles here.
Hi, after reading this remarkable post i am as well
cheerful to share my experience here with friends.
Thank you a bunch for sharing this with all folks you really recognise what you’re speaking approximately!
Bookmarked. Kindly also talk over with my web site =).
We will have a link alternate contract between us
Very nice post. I definitely appreciate this site. Keep writing!
Stop by my blog indian betting apps
I couldn’t resist commenting. Very well written!
I visited multiple blogs but the audio quality for audio
songs existing at this website is genuinely excellent.
My web-site :: Casino Online For Real Money
Hi there, I discovered your site by way of Google even as searching for
a related topic, your web site got here up, it looks great.
I have bookmarked it in my google bookmarks.
Hello there, just was alert to your blog thru Google, and found that it is really informative.
I’m gonna be careful for brussels. I will appreciate if
you proceed this in future. Many people will probably
be benefited from your writing. Cheers!
My page … 22bet
I will immediately grab your rss feed as I can’t find your email subscription hyperlink or newsletter service.
Do you’ve any? Please let me understand in order that I could subscribe.
It’s fantastic that you are getting thoughts from this article as well as from
our discussion made at this time.
my page :: paripesa
Hello, Neat post. There’s an issue along with your website in web explorer,
may test this? IE still is the marketplace
chief and a good part of other folks will pass over your excellent writing because of
I was curious if you ever thought of changing the
layout of your site? Its very well written; I love what youve got to say.
But maybe you could a little more in the way
of content so people could connect with it better. Youve
got an awful lot of text for only having one or two images.
Maybe you could space it out better?
Hey there! I know this is somewhat off topic but I was wondering if you knew where I
could find a captcha plugin for my comment form? I’m using the same blog
platform as yours and I’m having trouble finding one?
Thanks a lot!
Awesome issues here. I’m very satisfied to look your post.
Thank you so much and I am taking a look ahead to touch you.
Will you kindly drop me a e-mail?
My webpage; Casino Online Betting
I’m amazed, I have to admit. Rarely do I come across a blog that’s both equally educative and engaging, and let me tell you, you’ve hit the nail on the
head. The issue is something that too few people are speaking intelligently about.
Now i’m very happy that I found this in my hunt for
something regarding this.
Hi! This is my first comment here so I just wanted to give a quick shout
out and tell you I really enjoy reading through your articles.
Can you recommend any other blogs/websites/forums that go over the
same topics? Thanks for your time!
Check out my web-site … Free Spins
Nice post. I was checking constantly this blog and I’m
impressed! Extremely helpful information specifically
the last part 🙂 I handle such info a lot. I used to be seeking this
particular info for a long time. Thanks and best of luck.
I just like the helpful information you supply to your articles.
I will bookmark your weblog and test once more right here frequently.
I am rather sure I’ll be told many new stuff right right here!
Best of luck for the following!