IMPLEMENTATION OF LINEAR REGRESSION IN PYTHON

Hello everyone, in this blog I will explain how to implement Linear Regression using python.

This requires a pre-requisite which is Mathematical Implementation Of Linear Regression, which I have explained in one of my previous blogs, I have pasted the link below.

Click here to know the mathematical implementation of Linear Regression

Get to know the Mathematical implementation first, or else you might feel very difficult to understand this blog.

So let's start, now I will show how to implement Linear Regression in python.

NOTE:

Lines highlighted with Yellow color are actually code.

STEP 1:

Import necessary libraries that are required for numerical operations, importing data frames and plotting the graph.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

STEP 2:

Read the data
The data will be in the format of CSV, EXCEL there are still some, but I have mentioned only two.
The data which I'm taking is tiny it has only two columns SAT and GPA, and we will apply Linear regression to this dataset.
This dataset is in of the format CSV
So, let's read our dataset
Code to read the dataset is

data=pd.read_csv('/content/sat.csv')

Here, I'm loading the dataset into the data.
If I press data in the terminal I will get the dataset

data

STEP 3:

Split the input and output that is X and Y
Here SAT is the input which is X and GPA is the output which is Y.
Based on the SAT scores we are predicting the GPA.
so, let's split

x=data[['SAT']]
y=data[['GPA']]

So, now our dataset has split into X and Y.
let's see how it will look.

STEP 4:

Do EDA(Exploratory data analysis)
let's visualize the data.
We can see how the X and Y are scattered in a graph.

plt.scatter(x,y) --This function will scatter both X and Y
plt.xlabel('SAT') --plt.xlabel is used to set the label name for X
plt.ylabel('GPA') --plt.ylabel is used to set the label name for Y
plt.title("sat score") --plt.title displays the title of our graph

Our plot will look like this

STEP 5:

Apply Linear regression algorithm
We can import Linear regression from sklearn package
So, let's import it

from sklearn.linear_model import LinearRegression

Once the importing is done we should initialize the model

model=LinearRegression()

Now, we have initialized LinearRegression algorithm to our model.
Once initialization is done we have to fit our model with Independent variable(X) and Dependent variable(Y).

model.fit(x,y)

This phase is also known as the Training phase of the model.
Once the Training phase is done, we have to test our model with the test data.

y_predict=model.predict(x)

This process is known as the Testing phase.
So,Training and testing phase is done.
Now we have to calculate Yhat to draw a Regression line.
To calculate Yhat we need intercept and coefficient.
Let's calculate the intercept

model.intercept

Our output will be

Let's calculate the coefficient

model.coef_

Our output will be

So, as we know

Yhat=model.intercept+model.coef_*x
Yhat=0.2750403 +0.00165569 *x

If I press Yhat i will get the predicted values

Yhat

Step 6:

Do exploratory data analysis again.

plt.scatter(x, y)
plt.plot(x,yhat,c='r')
plt.xlabel('SAT')
plt.ylabel('GPA')
plt.title("sat score")
plt.show()
Here the Redline is the Regression line.
Distance from this to actual values(Residual) must always be less in order to get better accuracy.

Step 7:

Let's calculate mean_squared_error, mean_absolute_error , r2_score
These are called Evaluation Metrics.
Don't know what is Evaluation metrics? Click here to know the Evaluation metrics of Regression
Code to import the metrics is,

from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score

So, here are the snippets of metrics

mean_squared_error
mean_squared_error(y,y_predict)
mean_absolute_error
mean_absolute_error(y,y_predict)
r2_score
r2_score(y,y_predict)

As I said in my previous blog the common threshold for r2_score is 0.5.
If the score is <0.5 then it is not considered a good fit.
If the score is >0.5 then it is considered a good fit.
So, coming back to our model we have the r2_score of 0.40600 which indicates the model is not a good fit.