A quick and easy guide on statistical regression for machine learning
What you’ll learn
- Python Basics, Statistics and Regression behind Machine Learning in Python and also using Manual Calculations
Hello and welcome to my new course Basic Statistics and Regression for Machine Learning
You know.. there are mainly two kinds of ML enthusiasts.
The first type fantasize about Machine Learning and Artificial Intelligence. Thinking that its a magical voodoo thing. Even if they are into coding, they will just import the library, use the class and its functions. And will rely on the function to do the magic in the background.
The second kind are curious people. They are interested to learn what’s actually happening behind the scenes of these functions of the class. Even though they don’t want to go deep with all those mathematical complexities, they are still interested to learn what’s going on behind the scenes at least in a shallow Layman’s perspective way.
In this course, we are focusing mainly on the second kind of learners.
That’s why this is a special kind of course. Here we discuss the basics of Machine learning and the Mathematics of Statistical Regression which powers almost all of the the Machine Learning Algorithms.
We will have exercises for regression in both manual plain mathematical calculations and then compare the results with the ones we got using ready-made python functions.
Here is the list of contents that are included in this course.
In the first session, we will set-up the computer for doing the basic machine learning python exercises in your computer. We will install anaconda, the python framework. Then we will discuss about the components included in it. For manual method, a spreadsheet program like MS Excel is enough.
Before we proceed for those who are new to python, we have included few sessions in which we will learn the very basics of python programming language. We will learn Assignment, Flow control, Lists and Tuples, Dictionaries and Functions in python. We will also have a quick peek of the Python library called Numpy which is used for doing matrix calculations which is very useful for machine learning and also we will have an overview of Matplotlib which is a plotting library in python used for drawing graphs.
In the third session, we will discuss the basics of machine learning and different types of data.
In the next session we will learn a statistics technique called Central Tendency Analysis which finds out a most suitable single central value that attempts to describe a set of data and its behaviour. In statistics, the three common measures of central tendency are the mean, median, and mode. We will find mean, median, and mode using both manual calculation method and also using python functions.
After that we will try the statistics techniques called variance and standard deviation. Variance of a dataset measures how far a set of numbers is spread out from their average or central value. The Standard Deviation is a measure of how much these spread out numbers are. We will at first try the variance and standard deviation manually using plain mathematical calculations. After that, we will implement a python program to find both these values for the same dataset and we will verify the results.
Then comes a simple yet very useful technique called percentile. In statistics, a percentile is a score below which a given percentage of scores in a distribution falls. For easy understanding, we will try an example with manual calculation of percentile using raw data set at first and later we will do it with the help of python functions. We will then double-check the results
After that we will learn about distributions. It describes the grouping or the density of the samples in a dataset. There are two types. Normal Distribution where probability of x is highest at centre and lowest in the ends whereas in Uniform Distribution probability of x is constant. We will try both these distributions using visualization of data. We will do the calculation using manual calculation methods and also using python language.
There is a value called z score or standard score in statistics which helps us to determine where the value lies in the distribution. For z score also at first we will try calculation using python functions. Later the z score will be calculated with manual methods and will compare the results.
Those were the case of a single valued dataset. That is the dataset containing only a single column of data. For multi-variable dataset, we have to calculate the regression or the relation between the columns of data. At first we will visualize the data, analyse its form and structure using a scatter plot graph.
Then as the first type of regression analysis we will start with an introduction to simple Linear Regression. At first we will manually find the co-efficient of correlation using manual calculation and will store the results. After that we will find the slope equation using the obtained results. And then using the slope equation, we will predict future values. This prediction is the basic and important feature of all Machine Learning Systems. Where we give the input variable and the system will predict the output variable value.
Then we will repeat all these using Python Numpy library Methods and will do the future value prediction and later will compare the results. We will also discuss the scenarios which we can consider as a strong Linear Regression or weak Linear Regression.
Then we will see another type of regression analysis technique called as Polynomial Linear Regression which is best suited for finding the relation between the independent variable x and the dependent variable y.
The regression line in the graph will be a straight line with slope for Simple Linear Regression and for Polynomial Linear Regression, it will be a curve.