"Correlation is not causation" is a frequently cited principle when discussing the caution that must be taken when drawing causal inferences from observational data. Conflating correlation with causation is one of the easier fallacies to spot, and if we ever catch ourselves making that fallacy ourselves, there are a host of (silly) examples that demonstrate the flaw in our thinking.
Many problems also arise from relying on insufficient data. When only a few data points are used, trends that appear supported by the data may not pan out. These two limitations are well known; however, there is much more to be wary of.
Below are fallacious inferences (made from complete and accurate data) which may seem obviously true (almost tautological), but turn out to be incorrect. This will underscore the importance of rigorously defining our models and asking questions in a precise way.
The mistaken belief: Let's say I'm analyzing a population that can be divided into a handful of different subpopulations; maybe I'm dividing voters by state, or cancer patients by cancer type. If a correlation holds in every single subpopulation, then it must also hold in the population writ large.
Why it's wrong: The fact that a different correlation can hold in all the subpopulations than in the overall population is called Simpson's Paradox, and the Wikipedia page includes many insightful examples and explanations I recommend you read. Consider the example below. The thick line is the regression line for the shown data. The green dots demonstrate there's a negative correlation between the quantities plotted on the \(X\) and \(Y\) axes for the third subpopulation. Click on the radio buttons for each of the other subpopulations; each subpopulation is denoted with a different color. You will find that for each subpopulation, the revealed correlation is negative. However, when all subpopulations are included together, there appears to be a positive correlation.
The mistaken belief: Say I've collected many features, call them \(X_1,X_2,X_3,\) etc, about each individual in a population. I'm trying to predict some outcome, \(Y\), which for each individual in the population can either be TRUE (i.e. the outcome described by \(Y\) did happen) or FALSE (the outcome did not happen). Say I observe that \(X_1\) alone is a statistically good predictor of \(Y\), i.e. when \(X_1\) is very large, I can say \(Y\) is almost certain. Vice versa, when \(X_1\) is small, \(Y\) is rare. I want to improve my predictive accuracy, so I incorporate more features \(X_2, ... X_k\). These other features may turn out to be much better predictors of \(Y\) and supplant \(X_1\). However, I will never discover that \(X_1\) is a predictor of \(Y\) not happening, i.e. that all else equal, higher \(X_1\) actually leads to a lower likelihood of \(Y\).
Why it's wrong: Let's focus on the simplest case where I only add one additional feature (i.e., \(k=2\)). We can formalize the idea of predicting true-or-false outcomes by considering a separating hyperplane. If I make a scatter plot for all the individuals in my population, I can color the data points orange if \(Y\) does occur and red otherwise. I then seek a line, called a separator, which separates the red points from the orange points, i.e. all the data points for which \(Y\) occurs should (mostly) fall on one side of the line, and the data points for which \(Y\) does not occur on the other. In the dataset below, see that the orange points mostly fall on the right side of the separator (thick black line) when just \(X_1\) is used, indicating that high values of \(X_1\) correspond with \(Y\) being more likely. Now see that by adding \(X_2\), it turns out that the orange points mostly fall on the left side of the line, indicating that high values of \(X_1\) correspond with \(Y\) being less likely.
The mistaken belief: If there is a positive correlation between \(X\) and \(Y\), and also a positive correlation between \(Y\) and \(Z\), then I can conclude that there will be a positive correlation between \(X\) and \(Z\), although the correlation may not be very strong.
Why it's wrong: Actually, \(X\) and \(Z\) may be negatively correlated. There's an analogy to geometry here: pretend each feature is a ray. Then saying that two features are positively correlated means that the angle between them is acute. If the angle between the first and second ray is 60 degrees, and the angle between the second and third is 50 degrees, the total angle between the first and third may be as high as 110 degrees, making it obtuse. Formally, this analogy is captured by something called a pre-Hilbert Space using covariance as an inner product.
The specific ways this can happen are elaborated upon in the next two sections. Play around with the interactive below. Notice that plotting \(X\) vs \(Y\) and \(Y\) vs \(Z\) both reveal positive correlations, but that plotting \(X\) vs \(Z\) reveals a negative correlation. This negative correlation isn't due to chance -- refresh the page to generate new data, there will always be a negative correlation here.
The mistaken belief: If I make a scatter plot and observe that the regression line has a positive slope, then if I plot the same exact data on a log-log scatter plot I should also notice a positive slope.
Why it's wrong: This is a special case of the intransitivity of correlation. \(X\) is positively correlated with \(\log (X)\), and \(Y\) is positively correlated with \(\log (Y)\), and \(X\) and \(Y\) are positively correlated with each other, but \(\log (X)\) and \(\log (Y)\) are negatively correlated. The \(\log\) function "compresses" large values and "expands" small values, so \(\log(100,000)\) and \(\log(1,000,000)\) are only off by 2.3, but \(\log(0.001)\) and \(\log(0.00001)\) are off by 4.6. In this example, applying \(\log\) ends up compressing those \(Y\) values which suggested a positive trend, and enhancing those \(Y\) values which suggested a negative trend.
The mistaken belief: Consider the formula CollegeBoard (approximately) uses for calculating students' SAT scores. They start at \(0\), add \(1\) point for each correct answer, subtract \(1/4\) for each incorrect answer, then multiply the result by some constant so that the range is \(600\), then shift all the scores over so that the minimum is \(200\). Say there is a positive correlation between SAT scores and \(Y\). If I were to modify the formula for computing scores slightly, but so that I don't change the ordering of which students did better than others, then there would still be a positive correlation between SAT scores and \(Y\).
Why it's wrong: This is yet again a special case of intransitivity of correlation. The students' old scores will be correlated with their new ones -- and the correlation won't swap any of their rankings, but still the correlation with \(Y\) need not carry over. The same principle is at work as before: some gaps are compressed and others expanded. A similar effect can occur for any value with either an arbitrary unit or no unit. Examples of this inlcude...