# Multivariate statistical analysis and partitioning of sedimentary geochemical data sets: General principles and specific MATLAB scripts

Multivariate statistical treatments of large datasets in sedimentary geochemistry (e.g., SedDB, PetDB, VentDB, etc.) are rapidly becoming more popular as analytical and computational capabilities expand. Because geochemical datasets present a unique set of conditions (e.g., the closed array), application of generic off-the-shelf applications is not straightforward and can yield misleading results.

This paper presents annotated MATLAB scripts (and specific guidelines for their use) for Q-mode factor analysis, a constrained least squares multiple linear regression technique, and a total inversion protocol, that are based on the well-known approaches taken by Dymond [1981], Leinen and Pisias [1984], Kyte et al. [1993], and their predecessors. Although these techniques have been used by investigators for the past decades, their application has been neither consistent nor transparent, as their code has remained in-house or in formats not commonly used by many of today’s researchers (e.g., Fortran]. In addition to providing the annotated scripts and instructions for use, this paper also discusses general principles to be considered when performing multivariate statistical treatments of large geochemical datasets, provides a brief contextual history of each approach, explains their similarities and differences, and includes a sample data set for the user to test their own manipulation of the scripts.

**Figure Caption:** Q-mode varimax rotated factor scores from a fifty-five sample dataset of sedimentary chemistry from ODP Site 1149 in the northwest Pacific Ocean. These factor scores show the weight of each element on the discrimination of each factor. The “% explained” values indicate how much of the dataset’s variability is explained by the given factor. Results from this Q-mode analysis was subsequently used to identify potential end members contributing volcanic ash to the bulk marine sediment. These end members were then used in a constrained least squares multiple linear regression to quantify the abundance of each end member in each sample. From Scudder et al. (2009, EPSL, 284, 639-648, doi:10.1016/j.epsl.2009.05.037), and using earlier versions of the scripts presented in Pisias et al. (2013, G3).

Nicklas G. Pisias, Richard W. Murray, Rachel P. Scudder (2013), **Multivariate statistical analysis and partitioning of sedimentary geochemical data sets: General principles and specific MATLAB scripts,** *Geochemistry, Geophysics, Geosystems*. DOI: 10.1002/ggge.20247

Access the scripts at the EarthChem Library.