Normalization Example


Analysis Summary

Data analyses were implemented to normalize LC-MS metabolomics (parts 1 and 2) for analytical batch effects. This data contains 14526 measurements for 443 variables which were acquired from 03/22/2013 to 06/28/2015. This data contains Study samples, quality control (QC) and NIST laboratory references (NIST) samples (Table 1).


Table 1. Overview of sample types.

Study QCs NIST
12986 1382 158

Overview of Normalization Methods

A variety of data normalization approaches including: Raw, Cubic Splines, Batch Ratio, Splines + Batch Ratio, LOESS and LOESS + Batch Ratio were tested (Table 2). Normalization performance was evaluated based on relative standard deviation (RSD) within batches and across the complete study for each variable for QC, NIST and Study samples separately.


Table 2. Overview of data normalization methods.

Normalization Description
Raw Original data.
Cubic Splines Cubic splines fit to data quantiles.
Batch Ratio QC samples used to adjust sample batch median to global study median.
Splines + Batch Ratio Cubic splines followed by batch ratio normalization.
LOESS Normalization based on a locally weighted scatterplot smoothing model fit to QCs samples and their acqusition date.
LOESS + Batch Ratio LOESS followed by batch ratio normalization.

To make sure analytical batches contained enough of each respective sample type; batches were determined as Study_batch for Study samples, acquisition date for QC samples and acquisition month for NIST samples (Table 3).


Table 3. Overview of study batches.

Sample type Number of batches Median samples per batch
Study 347 38
QCs 243 7
NIST 21 8

Normalization method performance was evaluated based on batch and study wide variables RSD.

Figure 1. Analytical batches for NIST samples shown for a single variable (shifted logarithm base 10 transformed).

Figure 2. Trend line (LOESS smoothed) for batch RSD shown as function of variable mean (shifted logarithm base 10 transformed) for NIST samples.

Figure 3. Histogram of RSD ranges for all variables across all batches.

Figure 4. Histogram of median variable RSD ranges across all batches.


The optimal normalization method was selected based on both minimal batch and study wide variable RSD. The plots below show study wide normalization performance for NIST samples.

Figure 5. Trend line (LOESS smoothed) for study RSD shown as function of variable mean (shifted logarithm base 10 transformed) for NIST samples.

Figure 6. Histogram of RSD ranges for all variables across the complete study.

Figure 7. Histogram of median variable RSD ranges across the complete study.


Normalization performance was similarly evaluated Study samples. The plots below show normalization performance for Study samples across all batches.

Figure 8. Analytical batches for Study samples shown for a single variable (shifted logarithm base 10 transformed).

Figure 9. Trend line (LOESS smoothed) for batch RSD shown as function of variable mean (shifted logarithm base 10 transformed) for Study samples.

Figure 10. Histogram of RSD ranges for all variables across all batches.

Figure 11. Histogram of median variable RSD ranges across all batches.


The plots below show normalization performance for Study samples across the complete study.

Figure 12. Trend line (LOESS smoothed) for study RSD shown as function of variable mean (shifted logarithm base 10 transformed) for Study samples.

Figure 13. Histogram of RSD ranges for all variables across the complete study.

Figure 14.Histogram of median variable RSD ranges across the complete study.


Normalization Summary

Normalization performance was evaluated based on which method resulted the lowest RSD for each variable and the median rank based on decreasing RSD for each method. Based on an evaluation of normalization performance for NIST samples, Cubic Splines displayed lowest RSD for the most variables across batches and Cubic Splines across the study. Evaluation of each normalizations rank based on decreasing RSD was used to rank each method’s median performance for batches and across the complete the study (Table 4). The combination (median) of batch and study rank was used to identify the best overall normalization(s) as Cubic Splines and LOESS + Batch Ratio.

Table 4. Number of variables displaying minimum batch and study RSD across all normalizations for NIST samples.

Normalization Batch Study Batch rank Sample rank Overall rank
Raw 0 0 5 6 6
Cubic Splines 303 192 1 3 1
Batch Ratio 1 2 6 4 5
Splines + Batch Ratio 4 57 4 3 4
LOESS 21 20 2 4 3
LOESS + Batch Ratio 131 192 2 2 1

A similar approach was used to identify the optimal normalization based on an evaluation of Study samples batch and study-wide RSD. The LOESS normalization displayed lowest RSD for the most variables across batches and LOESS + Batch Ratio across the complete study. Based on overall batch and study-wide performance LOESS + Batch Ratio was identified as the best normalization for Study samples.

Table 5. Number of variables displaying minimum batch and study RSD across all normalizations for Study samples.

Normalization Batch Study Batch rank Sample rank Overall rank
Raw 0 32 5 4 5
Cubic Splines 12 0 3 5 4
Batch Ratio 0 0 6 5 6
Splines + Batch Ratio 0 58 3 3 3
LOESS 368 40 1 2 2
LOESS + Batch Ratio 0 638 1 1 1

The plots below show a comparison of PCA sample scores for the implemented normalizations for NIST and Study samples.


Figure 15. PCA overview of normalized nist samples with color displaying each sample’s acquisition month.


Figure 16. PCA overview of normalized Study samples with color displaying each sample’s acquisition month.


Results

All normalized data and results can be found in ./normalized data (Table 6).

Table 6. Worksheet name and description of normalization results.

Workbook.name Description
Study__batch_ratio.xlsx normalized data
Study__LOESS.xlsx normalized data
Study__LOESS_batch_ratio.xlsx normalized data
Study__splines.xlsx normalized data
Study__splines_batch_ratio.xlsx normalized data
variable normalization summary.xlsx Sample and NIST RSD summary for all variables across batches and the complete study

Notes

The LC-MS metabolomics (parts 1 and 2) data set showed the following:


Appendix

Cubic splines Workman (2002) and LOESS Dunn (2011) normalizations are batch independent methods which implement smoothing across many samples to estimate and remove analytical variance. The LOESS normalization fits a model to QC samples while the cubic splines method is QC sample independent. Unlike batch-wise normalization methods like the batch ratio method, both of theses normalizations fit a models to all samples across all batches. These methods are useful for removing continuous non-linear trends from the data, but may suffer in cases where their are abrupt changes in measured variables values between groups of samples. The batch ratio normalization are useful for adjusting batch median/mean values to the global study median and can effectively deal with abrupt changes in variable values across sample batches. However batch-wise methods can suffer when batches contain to few samples to robustly calculate median/mean values or contain outliers. The combination of a smoothing based and batch-wise normalization methods is suggested to effectively deal with both outlier samples and abrupt changes in variable values.


Software

All data analyses were implemented in R (Team (2011)) version R version 3.2.0 (2015-04-16).


References

Dunn, Warwick et al. 2011. “Procedures for Large-Scale Metabolic Profiling of Serum and Plasma Using Gas Chromatography and Liquid Chromatography Coupled to Mass Spectrometry.” Nat. Protocols 6: 1060–83.

Team, R Development Core. 2011. “R: A Language and Environment for Statistical Computing.”

Workman, Christopher et al. 2002. “A New Non-Linear Normalization Method for Reducing Variability in DNA Microarray Experiments.” Genome Biology 3: 119–28.