Thursday, September 4, 2014

Regression concepts simplified

Regression modelling technique is widely used in analytics and perhaps easiest to understand. In this post I am sharing my findings about the concept in simple words.

What is Simple Linear Regression?

A Simple Linear Regression allows you to determine functional dependency between two sets of numbers. For example, we can use regression to determine the relation between ice cream sales and average temperature.

Since we are talking about functional dependency between two sets of variables, we need an independent variable and one dependent variable. In the example above, if change in temperature leads to change in ice cream sales then, temperature is independent variable and sales is dependent variable.

Dependent variables is also called as criterion, response variable or label. It is denoted by Y.

The independent variable is also referred as covariates, predictor or features. It is denoted by X.



How it works?

Most of the modeling techniques, including regression make certain assumptions about the data they deal with. The principal assumption in linear regression is "Linearity and additivity". Let's focus on the linearity part as of now. It means the dependent variable is a linear function of independent variable.[Explore more]

Now if we consider a 2 dimensional space, then each pair of observation (Xi,Yi) is a point.



Based on linearity assumption discussed above, this model fitting given data(points) should be a straight line.

So in simple linear regression we have a bunch of points and we are trying to find a straight line with 'best fit' to accommodate them.

The equation of a straight line is,

Y = c + mX

So the regression (line) equation can be written as,

Y = β0 + β1 X

β0 is the intercept and β1 is the slope.

Both are called model coefficients or model parameters.

β1 represents the amount by which Y should change if we change X by one unit.

So basically building simple linear regression model is, solving this equation for values of model coefficients.

Now looking at the scatter plot above, it is impossible to fit a straight line through all data points. So we modify the equation slightly.

Y = β0 + β1 X + e  

Where "e" is the error while calculating Y.

Now let's get back to how do we decide the best fit. A regression line, which helps us to minimize the error component discussed in equation above, will be considered best fit.

Consider following image, to understand the error components.


The distance between predicted point (on true regression line) and observed point is an error. It is also called disturbance.

There is a similar concept called residual. Important thing to note here is, the term residual is NOT same as error. The distance between predicted point (on estimated regression line) and observed point is an residual. [Explore more]

It bothered me for a while why the error is not a shortest distance between point and regression line. (Why it is parallel to Y axis and not perpendicular to regression line like following)


It is because the regression equation we have created, says e is the error in determining Y. So we should measure it only in terms of Y (Yi' - Yi).

Let's talk about the math behind minimizing the error.

Based on previous discussion the intuitive feeling is, sum of all errors should be minimum.
ie- " ∑|Yi− Yi'|

However the actual criteria is somewhat different, it is called least squares. According to which ∑(Yi− Yi')2 should be minimum. The reason being absolute values are difficult to work with mathematics (especially calculus). [1]

Many people did not find that reason convincing enough. So let's look at this information from book Robust statistics,
There was a dispute between Eddington (1914, p.147) and Fisher (1920, footnote on p. 762) about the relative merits of dn and sn.

dn = (1/n) ∑|xi− x̄|
sn = [ (1/n) ∑(xi− x̄) 2 ] 1/2

Fisher seemingly settled the matter by pointing out that for normal observations sn is about 12% more efficient than dn.

A more efficient estimator is one that needs fewer samples than a less efficient one to achieve a given accuracy.[2]

To summarize the discussion, we implement the simple linear regression using least squares fitting criteria, in order to determine the functional dependency between two set of numbers.

Beyond simple linear regression:

1. In real world problems it is useful to have more than one independent variable. In that case, the model is extended as,

    Y = β0  +  β1 X1  +  β2 X2  +  ... +  βn Xn 

This is called multiple linear regression. In this case Y is dependent variable and X is set of multiple independent variables like X1, X2...Xn.


2. If we measure error in terms of both X and Y as,













then it is called Deming regression.


3. Sometimes the relationship between dependent variable and independent variable is not linear. So we can extend linear regression for non-linear data like,

     Y = β0  +  β1 X  +  β2 X 2 

in this case it is called Polynomial regression. Note: Polynomial regression is one of the ways to deal with non-linear relationship.


4. If we have binary dependent variable then we can can use logistic regression or logit regression.
Then it is called Binomial logistic regression. Note: Logistic regression is considered as classifier.


5. Binomial logistic regression is generalized to Multinomial logistic regression, also called Multinomial regression. It is capable of dealing with dependent variable with three or more categories.


Summary:
  • Linear regression is used when a dependent variable has linear relation with independent variable(s). 
  • Least square is a technique we use for data fitting in linear regression.
  • Linear regression can be Simple linear regression with only one independent variable or Multiple linear regression with more than one independent variable.
  • Error in linear regression is measured only in terms of Y.
  • Residuals are not same as errors.

Feel free to suggest additions in comment section.

3 comments:

  1. Thank you so much for this post. The details level that you gave helpe to put this subject on the next level. So, I have a question. Well, the line equation is defined by y = ax + b where a = (y2-y1)/(x2-x1). In linear regression we have the same structure y' = B1x + B0. However, B1 is defined as: sum((x-mean(x))-(y-mean(y)))/sum(x-mean(x))^2. Why X needs to show in the numerator of B1 ? Why not sum(y-mean(y))^2 / sum(x-mean(x))^2 being close to lienar equation? I know the analytic answer, I did the RSS calculus but I don't know the feeling from behind it... Thank you so much!

    ReplyDelete
  2. scores in math of 10 students
    30,20,54,10, 80, 90.75,66,17, 95= (53.7 mean)
    Please tell meaning of gap between mean score and actual score and what it implies?
    Prof P.K.Pattnaik

    ReplyDelete