# Correlation

**Correlation** is a measure of relationship between two variables. There are several relations involving variables such as: linear, (in general) non-linear, and others. Also, variables can have differing quantities of correlation to each other. **Pearson's correlation coefficient** or PCC is the most common linear coefficient measuring the degree of correlation between two variables. The PCC between two given variables, denoted $r$, is a number between $-1$ and $+1$ inclusive. The meaning of this measure is the degree of strength at which these variables are positively $(\text{close to}\hspace{1mm}+1)$, negatively $(\text{close to}\hspace{1mm}-1)$, or not correlated $(\text{close to}\hspace{2mm}0)$. A value of $+1$ means a **perfect positive** relation between the variables, that is, they are related by an increasing linear function. A value of $-1$ means a **perfect negative** relation between the variables, that is, they are related by a decreasing linear function. Although possible, perfect correlations are extremely rare. A value of $0$ means no predictive strength among the variables.

Jim is not only an accomplished baseball player but he is also a talented mathematics tenth grader. His motivation in analyzing performance first came to fruition when his physics professor explained the optimal way to hit a ball for maximum distance. Since each baseball player is unique in their skills, he wants to find out if he is better suited to be a hitter or a slugger. As a wise mathematician/baseball player will do, he keeps data of each of his batting turns from last season and approximates batting strength and their outcomes: **ground hit**, **home run**, or **out**. He knows that consistent hits require more direction than strength as opposed to home runs, so he builds three data sets. These are graphs and PCC's of the following: strength vs. home runs, strength vs. hits, and strength vs. outs.

If his decision will require only one of these three comparisons to make a choice, which one you think will give him the best measure of his abilities?

## Definitions

Pearson's Correlation Coefficient: general definitionThe Pearson's correlation coefficient between two random variables $X$ and $Y$ with respective means $\mu_X$ and $\mu_Y$ and standard deviations $\sigma_X$ and $\sigma_Y$ is

$r_{XY} = \frac{E\big[(X-\mu_X)(Y-\mu_Y)\big]}{\sigma_X \sigma_Y},$

where $E[ \cdot ]$ is the expected value function.

Pearson's Correlation Coefficient: sample definitionThe Pearson's correlation coefficient between two samples of size $n$ with $x_i$ and $y_i\, (i=1, 2, \ldots ,n)$ is

$r_{xy} = \frac{\sum{(x_i - \bar{x})(y_i - \bar{y})}}{\sqrt{\sum{(x_i - \bar{x})^2}} \sqrt{\sum{(y_i - \bar{y})^2}}},$

where $\bar{x}$ is the average of the sample values $x_i$ and similarly for $y$.

The basic difference among these definitions is the knowledge about the data. The general definition relates two random variables for which its mean and standard deviation are known. In the study of statistics, most times, knowledge of the distribution of the values is assumed. When quantifying qualitative variables, it is important to know certain statistical measures (parameters) in order to build models to fit the given data. The sample definition comes into play when explaining relationships among variables.

## Interpretation

A correlation coefficient that is closer to $+1$ implies a strong **positive correlation** among the variables. It roughly means that when one variable is increased, the other will increase with a similar ratio between consecutive values. Having a strong positive relation measured by the PCC does not imply cause and effect from one variable to the other, but it gives reasons to further analyze a hypothesized statement. For example, if a PCC close to +1 is found in a study of sugar intake (one variable) compared to body fat percentage (the other variable) taken from a group of one hundred randomly selected people, then there is a positive correlation. The PCC relates these variables but does not imply that increases in body fat percentage are due to increases in sugar intake alone. Further, and stronger, tests would be needed to draw specific and actionable conclusions.

A correlation coefficient that is closer to $-1$, a **negative correlation**, does not mean the variables are not correlated; to the contrary, the only difference lies in the sign of the ratio between consecutive values (as previously explained). For example, if a PCC close to -1 is found in a study of physical activity (one variable) compared to stress levels (the other variable) taken from a group of one hundred randomly selected people, then there is a negative correlation. The PCC relates these variables but does not imply that decreases in stress levels are due to increases in physical activity alone. Further, and stronger, tests would be needed to draw specific and actionable conclusions.

A correlation coefficient that is closer to $0$ implies the variables are **not correlated**. Having no correlation does not imply having no linear relation since if a horizontal linear pattern is observed, this means that when increasing one variable the other remains virtually unchanged. PCC is not a measure of independence, but independent variables will have a PCC of zero. When variables exhibit erratic patterns, their PCC will be close to zero.

## Real-life Examples

Examples of positive correlation in real life:

- The harder a person studies, the better their grade.
- The higher the temperature, the higher the ice cream sales.
- The lower the volume of my earphones, the lower my risk of eardrum damage.
- The more the water flow into a full bathtub, the more the water on the floor of the bathroom.

Examples of negative correlation in real life:

- The more stressful I feel, the less fun I have.
- The less unhealthy the food that I eat, the higher my good cholesterol levels.
- The more air I take off a straw, the less the pressure inside the straw.
- The less money I spend, the more money I have left in the long run.

## Correlation vs Causation

It is important to stress that having strong correlation does not imply causation. The fact that two variables (observations) are strongly correlated via a PCC measure does not mean that a change in one variable will cause a change in the other as a consequence, it can be purely coincidental. As an example, it can be assumed that, in a city, the number of pigeons in a given location is strongly correlated to the number of people at that location at noon. If the number of people at that location dramatically decreased, it can be expected that the number of pigeons will decrease as well; this analysis makes perfect sense. However, one way to explain this phenomenon might be the failure of the city to clean their streets and locations where people meet to have lunch. Another explanation could be that a high percentage of people will feed the pigeons at noon with their leftover bread. These arguments are dependent upon other complicated assumptions and not solely on the number of people in the city at this location.

## Hypothesis Testing

Data **correlation** can be used to test whether a certain hypothesis is true or not.

For example, we may want to show that there is a direct relationship between the height of customers in the store and whether they buy beans.

Note however that, as mentioned above, **correlation** does not imply **causation**. For example, suppose we show that there is a correlation between the two: i.e. the customers who are tallest bought the most beans. This doesn't imply that they bought the most beans because they were tall (e.g. the beans were on the top shelf?) or that they were tall because they bought the most beans (beans found to have a super growth ingredient in them). Or perhaps a hidden parameter affecting both (basketball players are tall, and a basketball coach down the street encourages players to eat beans). Further analysis would need to be done to determine causality.

Further, in order for the data to be meaningful, you would need to collect a statistically sufficient sampling of the data. For example, if you saw two customers, and one bought beans and one didn't (and one was tall and the other wasn't), this clearly wouldn't be statistically sufficient to be able to make any meaningful conclusion.

Finally, if we say there is a strong correlation, strong is a relative term. The higher the magnitude of the correlation coefficient $($always between $-1$ and $1$ inclusive$),$ the stronger the correlation. A correlation coefficient close to $-1$ indicates a strong negative correlation, and a number close to $1$ indicates a strong positive correlation.

Consider the following example, where we'd like to determine whether our hypothesis, $H_1$ is true.

The null hypothesis $($denoted $H_0)$ is a statement that is assumed to be true. If the null hypotheses is rejected, then there is enough evidence (statistical significance) to accept the alternate hypothesis $($denoted $H_1).$ Before doing any test for significance, both hypotheses must be clearly stated and non-conflictive, i.e. they must be mutually exclusive statements.

We would like to find evidence that there is a linear relationship between Miami's annual average temperature $(X)$ and the average number of Atlantic Basin hurricanes $(Y).$ The table below is a 20-year sample taken from the years 1980 to 1999.

Average Annual Temperature in Miami Average Number of Annual Hurricanes $\hspace{30mm}$ 69.4 $\hspace{30mm}$ 5 $\hspace{30mm}$ 69.8 $\hspace{30mm}$ 5 $\hspace{30mm}$ 69.9 $\hspace{30mm}$ 9 $\hspace{30mm}$ 70.1 $\hspace{30mm}$ 7 $\hspace{30mm}$ 70.2 $\hspace{30mm}$ 3 $\hspace{30mm}$ 70.4 $\hspace{30mm}$ 4 $\hspace{30mm}$ 70.5 $\hspace{30mm}$ 4 $\hspace{30mm}$ 70.9 $\hspace{30mm}$ 9 $\hspace{30mm}$ 71 $\hspace{30mm}$ 7 $\hspace{30mm}$ 71.2 $\hspace{30mm}$ 6 $\hspace{30mm}$ 71.7 $\hspace{30mm}$ 5 $\hspace{30mm}$ 71.9 $\hspace{30mm}$ 4 $\hspace{30mm}$ 72.5 $\hspace{30mm}$ 10 $\hspace{30mm}$ 72.6 $\hspace{30mm}$ 8

Using this sample, a hypothesis test will be performed with the following setting and the $t$-distribution with $\alpha=0.05:$

- $H_0:$ There is no linear relation between $X$ and $Y,$ that is, $r_{xy}=0$.
- $H_1:$ There is a linear relation between $X$ and $Y,$ that is, $r_{xy} \neq 0$.
Compute the test statistic $t^*=\frac{r\sqrt{n-2}}{\sqrt{1-r^2}}$. Using PCC of $r=0.3407,$ we get $t^*=1.3354$.

Proceed to perform the hypothesis test using the $t$-distribution with $12$ degrees of freedom and $t$-score 1.3354. We must obtain the $p$-value to compare it to the significance level $\alpha=0.05$. In order to reject the null hypothesis, we must find the probability of having a test statistic more than 1.3354. Using a $t$-distribution table, we get that the probability of getting a test statistic smaller than 1.3354 is approximately 0.8975. Therefore, the probability of getting a test statistic greater than 1.3354 is approximately 0.1025; since it is a two-tailed statement, we multiply by 2 to get a $p$-value of 0.205. Since the $p$-value is greater than 0.05, we fail to reject the null hypothesis $H_0$.

Conclusion:There is not enough evidence from this sample to support the claim that there is a linear relationship between Miami's annual average temperature and the average number of Atlantic Basin hurricanes.