# Regression

## Description

Regression is a method used to describe the relationship between two variables and to predict one variable from another (if you know one variable, then how well can you predict a second variable?).

Whereas for correlation the two variables need to have a Normal distribution, this is not a requirement for regression analysis. The variable X does not need to be a random sample with a Normal distribution (the values for X can be chosen by the experimentor). However, the variability of Y should be the same at each level of X.

## Required input

- Select the variables of interest.
- Optionally select a filter to include a subset of cases.

## Regression equation

SciStat.com offers a choice of 5 different regression equations (x represents the independent variable and y the dependent variable):

y | = | a + b x | straight line |

y | = | a + b log(x) | logarithmic curve |

log(y) | = | a + b x | exponential curve |

log(y) | = | a + b log(x) | geometric curve |

y | = | a + b X + c x^{2} | quadratic regression (parabola) |

### Options

**Residuals**: you can select a Tests for Normal distribution of the residuals.

## Graph options

**Scatter diagram**: show a scatter diagram with regression line.**Draw line of equality**: option to draw the line of equality (y=x) line in the graph.**Heat map**: option to display a heatmap, where background color coding indicates density of points, suggesting clusters of observations.

## Results

The results for Regression include:

**Sample size**: the number of (selected) data pairs

**Coefficient of determination R ^{2}**: this is the proportion of the variation in the dependent variable explained by the regression model. It can range from 0 to 1 and is a measure of the goodness of fit of the model.

Note: SciStat.com does not report the coefficient of determination in case of regression through the origin, because it does not offer a good interpretation of the regression through the origin model (see Eisenhauer, 2003).

**Residual standard deviation**: the standard deviation of the residuals (residuals = differences between observed and predicted values)

**The regression equation**: the selected equation with the calculated values for intercept *a* and slope *b* (and for a parabola a third coefficient *c*). E.g.

y = *a* + *b* x

The standard errors are given for the intercept *a* and the slope *b*, followed by the t-value and the P-value for the hypothesis that these coefficients are equal to 0. If the P-values are low (e.g. less than 0.05), then you can conclude that the coefficients are different from 0.

Note that when you use the regression equation for prediction, you may only apply it to values in the range of the actual observations. E.g. when you have calculated the regression equation for height and weight for school children, this equation cannot be applied to adults.

**Analysis of variance table**, with F-ratio and P value.

The analysis of variance table divides the total variation in the dependent variable into two components, one which can be attributed to the regression model (labeled Regression) and one which cannot (labelled Residual). If the significance level for the F-test is small (less than 0.05), then the hypothesis that there is no (linear) relationship can be rejected.

### Analysis of residuals

## See also

## Link

Go to Regression.