Imagine that you want to **replicate a dataset** based on a specific behavior. You already know that data should perform a quadratic equation with the minimum in x = 500 with positive values. However, is there any way to simulate a similar scenario with Python?

Summary

Imagine that you want to **replicate a dataset** based on a specific behavior. You already know that data should perform a quadratic equation with the minimum in x = 500 with positive values. However, is there any way to simulate a similar scenario with Python?

This tutorial aims to learn how to create a **sample dataset** for regression problems. I will be covering linear regression and non-linear regression equations (polynomic, exponential…). I will continue updating this tutorial with new regression problems in the future. You can find the notebook for this tutorial** on my GitHub account.**

# Linear Regression

## Simple Regression

As you might know, linear regression is based on **linear equations** with the following form:

y = ax + b

where a is the **slope** and b is the **cut in the y-axis**.

To build a linear equation, an option is to use the function **make_regression**() from the **Sklearn library** to create samples of X and Y. The mean parameters you can add to this function are:

**n_sample**: number of samples**n_features**: number of variables**n_informative**: number of informative variables to create the output**n_targets**: number of regression targets**noise**: standard deviation of the output**random_state**: the seed to control the randomness of the output

You can take a look at the rest of the parameters on the Sklearn documentation.

First of all, let’s import the libraries:

```
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn import datasets
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
```

Now, let’s apply the make_regression() function:

```
x, y = datasets.make_regression(n_samples = 200, n_features = 1,
n_informative = 1, n_targets = 1,
noise = 20, random_state=12345, effective_rank=None)
```

If you want to determine in which range X and Y values are moving, you can use a Numpy function called **np.intern()** specifying the minimum and the maximum for each one:

```
x = np.interp(x, (x.min(), x.max()), (3876, 15678))
y = np.interp(y, (y.min(), y.max()), (1678, 5435))
```

Let’s plot the function to see what it looks like:

```
plt.ion()
plt.plot(x,y,'.')
```

# Non-linear Regression

## Polynomial Regression

In the case of polynomic regression, we need to apply a more **complex methodology**. For this example, I will be calculating a **cubic equation**. So, we are looking for an equation that has the following form:

y = ax^3 + bx^2 + cx + d

First, let’s determine the X coordinates of the **maximum** and the **minimum** values that we want to build the first derivative of our desired function.

To build my function for this specific exercise, I will use max_x = 3000 and min_y = 5000.

Then, create the **factorial function** based on those values. In this case, it will be the following one:

f'(x) = (x – 3000)(x – 5000)

After developing, we have the following equation:

f'(x) = x^2 – 8000x + 1.5·10^7

As I said, this is the **first derivative**. To get our cubic equation, it’s necessary to **integrate** using the Scipy and Sympy libraries. Let’s import them:

```
import scipy as sp
from sympy import *
```

Now, I will create the symbol for X as it will be our unknown factor:

```
init_printing(use_unicode=False, wrap_line=False)
x = Symbol('x')
```

In the next step, I will integrate the first derivative to get the cubic equation:

`integrate(x**2 - 8000*x + 1.5*(10**7), x)`

This is the result of the integral that will be the base to build our cubic equation:

y = 1/3*x*^3 − 4000*x*^2 + 15000000*x*

All equations have different types of transformations.

- The first one is changing the
**move the function up and down**over the coordinate axis. - The second one is to multiply or divide the whole function to
**stretch or flatten it**.

Let’s plot the resultant equation to see what it looks like by defining a function:

```
def plot_me(a, b, c, d):
x = np.arange(0, 7000, 0.05)
y = [(a*i**3 + b*i**2 + c*i + d) for i in x]
plt.plot(x, y, label='cubic', linestyle='-')
plt.grid(True)
plt.show(block=False)
plt.pause(10)
plt.close()
plot_me(1/3, -4000, 15000000, 0)
```

Let’s do some transformations over the equation, so the range for the Y value is between 0 and 5000. To do so, I will divide the whole equation by 1500000. Also, I will move it 3000 units up (this means d = 3000).

`plot_me(0.3333333/15000000, -4000/15000000, 15000000/15000000, 3000)`

In the next post, I will share with you how to do a similar approach with **exponential** and **logarithmic** equations.