Normalization Example


Analysis Summary

Data analyses were implemented to normalize LC-MS metabolomics (parts 1 and 2) for analytical batch effects. This data contains 14526 measurements for 443 variables which were acquired from 03/22/2013 to 06/28/2015. This data contains Study samples, quality control (QC) and NIST laboratory references (NIST) samples (Table 1).


Table 1. Overview of sample types.

Study QCs NIST
12986 1382 158

Overview of Normalization Methods

A variety of data normalization approaches including: Raw, Cubic Splines, Batch Ratio, Splines + Batch Ratio, LOESS and LOESS + Batch Ratio were tested (Table 2). Normalization performance was evaluated based on relative standard deviation (RSD) within batches and across the complete study for each variable for QC, NIST and Study samples separately.


Table 2. Overview of data normalization methods.

Normalization Description
Raw Original data.
Cubic Splines Cubic splines fit to data quantiles.
Batch Ratio QC samples used to adjust sample batch median to global study median.
Splines + Batch Ratio Cubic splines followed by batch ratio normalization.
LOESS Normalization based on a locally weighted scatterplot smoothing model fit to QCs samples and their acqusition date.
LOESS + Batch Ratio LOESS followed by batch ratio normalization.

To make sure analytical batches contained enough of each respective sample type; batches were determined as Study_batch for Study samples, acquisition date for QC samples and acquisition month for NIST samples (Table 3).


Table 3. Overview of study batches.

Sample type Number of batches Median samples per batch
Study 347 38
QCs 243 7
NIST 21 8

Normalization method performance was evaluated based on batch and study wide variables RSD.

Figure 1. Analytical batches for NIST samples shown for a single variable (shifted logarithm base 10 transformed).

Figure 2. Trend line (LOESS smoothed) for batch RSD shown as function of variable mean (shifted logarithm base 10 transformed) for NIST samples.

Figure 3. Histogram of RSD ranges for all variables across all batches.

Figure 4. Histogram of median variable RSD ranges across all batches.


The optimal normalization method was selected based on both minimal batch and study wide variable RSD. The plots below show study wide normalization performance for NIST samples.

Figure 5. Trend line (LOESS smoothed) for study RSD shown as function of variable mean (shifted logarithm base 10 transformed) for NIST samples.

Figure 6. Histogram of RSD ranges for all variables across the complete study.

Figure 7. Histogram of median variable RSD ranges across the complete study.


Normalization performance was similarly evaluated Study samples. The plots below show normalization performance for Study samples across all batches.

Figure 8. Analytical batches for Study samples shown for a single variable (shifted logarithm base 10 transformed).

Figure 9. Trend line (LOESS smoothed) for batch RSD shown as function of variable mean (shifted logarithm base 10 transformed) for Study samples.

Figure 10. Histogram of RSD ranges for all variables across all batches.

Figure 11. Histogram of median variable RSD ranges across all batches.


The plots below show normalization performance for Study samples across the complete study.

Figure 12. Trend line (LOESS smoothed) for study RSD shown as function of variable mean (shifted logarithm base 10 transformed) for Study samples.

Figure 13. Histogram of RSD ranges for all variables across the complete study.

Figure 14.Histogram of median variable RSD ranges across the complete study.


Normalization Summary

Normalization performance was evaluated based on which method resulted the lowest RSD for each variable and the median rank based on decreasing RSD for each method. Based on an evaluation of normalization performance for NIST samples, Cubic Splines displayed lowest RSD for the most variables across batches and Cubic Splines across the study. Evaluation of each normalizations rank based on decreasing RSD was used to rank each method’s median performance for batches and across the complete the study (Table 4). The combination (median) of batch and study rank was used to identify the best overall normalization(s) as Cubic Splines and LOESS + Batch Ratio.

Table 4. Number of variables displaying minimum batch and study RSD across all normalizations for NIST samples.

Normalization Batch Study Batch rank Sample rank Overall rank
Raw 0 0 5 6 6
Cubic Splines 303 192 1 3 1
Batch Ratio 1 2 6 4 5
Splines + Batch Ratio 4 57 4 3 4
LOESS 21 20 2 4 3
LOESS + Batch Ratio 131 192 2 2 1

A similar approach was used to identify the optimal normalization based on an evaluation of Study samples batch and study-wide RSD. The LOESS normalization displayed lowest RSD for the most variables across batches and LOESS + Batch Ratio across the complete study. Based on overall batch and study-wide performance LOESS + Batch Ratio was identified as the best normalization for Study samples.

Table 5. Number of variables displaying minimum batch and study RSD across all normalizations for Study samples.

Normalization Batch Study Batch rank Sample rank Overall rank
Raw 0 32 5 4 5
Cubic Splines 12 0 3 5 4
Batch Ratio 0 0 6 5 6
Splines + Batch Ratio 0 58 3 3 3
LOESS 368 40 1 2 2
LOESS + Batch Ratio 0 638 1 1 1

The plots below show a comparison of PCA sample scores for the implemented normalizations for NIST and Study samples.