Hello everyone, in this blog I will explain how to implement Linear Regression using python.
This requires a pre-requisite which is Mathematical Implementation Of Linear Regression, which I have explained in one of my previous blogs, I have pasted the link below.
Click here to know the mathematical implementation of Linear Regression
Get to know the Mathematical implementation first, or else you might feel very difficult to understand this blog.
So let's start, now I will show how to implement Linear Regression in python.
NOTE:
- Lines highlighted with Yellow color are actually code.
STEP 1:
- Import necessary libraries that are required for numerical operations, importing data frames and plotting the graph.
- import numpy as np
- import pandas as pd
- import matplotlib.pyplot as plt
- Read the data
- The data will be in the format of CSV, EXCEL there are still some, but I have mentioned only two.
- The data which I'm taking is tiny it has only two columns SAT and GPA, and we will apply Linear regression to this dataset.
- This dataset is in of the format CSV
- So, let's read our dataset
- Code to read the dataset is
- data=pd.read_csv('/content/sat.csv')
- Here, I'm loading the dataset into the data.
- If I press data in the terminal I will get the dataset
- data
STEP 3:
- Split the input and output that is X and Y
- Here SAT is the input which is X and GPA is the output which is Y.
- Based on the SAT scores we are predicting the GPA.
- so, let's split
- x=data[['SAT']]
- y=data[['GPA']]
- So, now our dataset has split into X and Y.
- let's see how it will look.
- X
- Y
- Do EDA(Exploratory data analysis)
- let's visualize the data.
- We can see how the X and Y are scattered in a graph.
- plt.scatter(x,y) --This function will scatter both X and Y
- plt.xlabel('SAT') --plt.xlabel is used to set the label name for X
- plt.ylabel('GPA') --plt.ylabel is used to set the label name for Y
- plt.title("sat score") --plt.title displays the title of our graph
- Our plot will look like this
- Apply Linear regression algorithm
- We can import Linear regression from sklearn package
- So, let's import it
- from sklearn.linear_model import LinearRegression
- Once the importing is done we should initialize the model
- model=LinearRegression()
- Now, we have initialized LinearRegression algorithm to our model.
- Once initialization is done we have to fit our model with Independent variable(X) and Dependent variable(Y).
- model.fit(x,y)
- This phase is also known as the Training phase of the model.
- Once the Training phase is done, we have to test our model with the test data.
- y_predict=model.predict(x)
- This process is known as the Testing phase.
- So,Training and testing phase is done.
- Now we have to calculate Yhat to draw a Regression line.
- To calculate Yhat we need intercept and coefficient.
- Let's calculate the intercept
- model.intercept
- Our output will be
- Let's calculate the coefficient
- model.coef_
- Our output will be
- So, as we know
- Yhat=model.intercept+model.coef_*x
- Yhat=0.2750403 +0.00165569 *x
- If I press Yhat i will get the predicted values
- Yhat
- Do exploratory data analysis again.
- plt.scatter(x, y)
- plt.plot(x,yhat,c='r')
- plt.xlabel('SAT')
- plt.ylabel('GPA')
- plt.title("sat score")
- plt.show()
- Here the Redline is the Regression line.
- Distance from this to actual values(Residual) must always be less in order to get better accuracy.
Step 7:
- Let's calculate mean_squared_error, mean_absolute_error , r2_score
- These are called Evaluation Metrics.
- Don't know what is Evaluation metrics? Click here to know the Evaluation metrics of Regression
- Code to import the metrics is,
- from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score
- So, here are the snippets of metrics
- mean_squared_error
- mean_squared_error(y,y_predict)
- mean_absolute_error
- r2_score
- As I said in my previous blog the common threshold for r2_score is 0.5.
- If the score is <0.5 then it is not considered a good fit.
- If the score is >0.5 then it is considered a good fit.
- So, coming back to our model we have the r2_score of 0.40600 which indicates the model is not a good fit.







Soo cool and easy to understand. Thanks for your blog!
ReplyDeleteGood job on this
ReplyDeleteDon't have complete knowledge on this but still readability is excellent👏 Need real time applications and uses on this in your next blog🙂
ReplyDelete