# Work In Progress

Our Elements guide is still in progress, and therefore lacks
full visual and technical assets. We hope to release them by
summer of 2020. Thanks for reading
*Lingua Franca*!

# Usage

A *correlation* is a quick visual indicator of the relationship between a model’s decision and the data it was trained on. The meaning of ‘relationship’ varies by domain, but in general users want to know which input fields contributed most to the output. Because of the fuzziness in interpreting relationships, as well as the inherent limitations of causally linking information, correlations should be used sparingly for explanatory purposes.

# Theory

Correlation analyses are promising tools in the domain where one can assume largely linear relationships, or where the correlations being shown do not carry much risk to the user (i.e. misleading correlations do not have a large impact). Some versions of correlation analysis (e.g. Pearson^{[1]}) tend to overstate the importance of outliers and generally carry less explanatory capabilities in situations where the user is detecting a small-probability event.

# Outliers

However, all correlation analysis is by nature limited in explaining complex relationships by attempting to simplify the messiness of the real world. In many cases, the correlation analysis can be ‘gamed’ or deceived simply by the way that data is collected and described. For example, say an event has two causes—weather and altitude—but your data collects weather with temperature, condition, chance of rain, etc. Your correlation analysis will ‘split’ the importance of weather among those properties, so altitude may show as the most important factor to most model predictions. If weather were defined as a single field, then the opposite effect may be observed.

# Implementation

Traditional statistical tools prior to the emergence of neural networks possess the capability for factor analysis to determine correlations (e.g. information gain for decision trees^{[2]}). Modern attempts at correlation analysis for complex neural networks attempt include LIME^{[3]}, an algorithm that perturbs the input data in order to determine which output variables are most sensitive to small changes in it. As mentioned above, these tools all carry both theoretical and practical caveats.

# Further Resources

- Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing by Justin Matejka & George Fitzmaurice

# Footnotes

Pearson Correlation Coefficient on Wikipedia ↩︎

Information gain in decision trees on Wikipedia ↩︎

Local Interpretable Model-Agnostic Explanations (LIME) on O’Reilly ↩︎