The transcriptional architecture is a complex and dynamic aspect of a cell's function. Next generation sequencing of steady state RNA (RNA-seq) gives unprecedented detail about the RNA landscape within a cell. Not only can expression levels of genes be interrogated without specific prior knowledge, but comparisons of expression levels between genes within a sample can be made. It has also been demonstrated that splicing variants [1, 2] and single nucleotide polymorphisms  can be detected through sequencing the transcriptome, opening up the opportunity to interrogate allele-specific expression and RNA editing.
An important aspect of dealing with the vast amounts of data generated from short read sequencing is the processing methods used to extract and interpret the information. Experience with microarray data has repeatedly shown that normalization is a critical component of the processing pipeline, allowing accurate estimation and detection of differential expression (DE) . The aim of normalization is to remove systematic technical effects that occur in the data to ensure that technical bias has minimal impact on the results. However, the procedure for generating RNA-seq data is fundamentally different from that for microarray data, so the normalization methods used are not directly applicable. It has been suggested that 'One particularly powerful advantage of RNA-seq is that it can capture transcriptome dynamics across different tissues or conditions without sophisticated normalization of data sets' . We demonstrate here that the reality of RNA-seq data analysis is not this simple; normalization is often still an important consideration.
Scaling to library size as a form of normalization makes intuitive sense, given it is expected that sequencing a sample to half the depth will give, on average, half the number of reads mapping to each gene. We believe this is appropriate for normalizing between replicate samples of an RNA population. However, library size scaling is too simple for many biological applications. The number of tags expected to map to a gene is not only dependent on the expression level and length of the gene, but also the composition of the RNA population that is being sampled. Thus, if a large number of genes are unique to, or highly expressed in, one experimental condition, the sequencing 'real estate' available for the remaining genes in that sample is decreased. If not adjusted for, this sampling artifact can force the DE analysis to be skewed towards one experimental condition. Current analysis methods [6, 11] have not accounted for this proportionality property of the data explicitly, potentially giving rise to higher false positive rates and lower power to detect true differences.
The fundamental issue here is the appropriate metric of expression to compare across samples. The standard procedure is to compute the proportion of each gene's reads relative to the total number of reads and compare that across all samples, either by transforming the original data or by introducing a constant into a statistical model. However, since different experimental conditions (for example, tissues) express diverse RNA repertoires, we cannot always expect the proportions to be directly comparable. Furthermore, we argue that in the discovery of biologically meaningful changes in expression, it should be considered undesirable to have under- or oversampling effects (discussed further below) guiding the DE calls. The normalization method presented below uses the raw data to estimate appropriate scaling factors that can be used in downstream statistical analysis procedures, thus accounting for the sampling properties of RNA-seq data.
Estimated normalization factors should ensure that a gene with the same expression level in two samples is not detected as DE. To further highlight the need for more sophisticated normalization procedures in RNA-seq data, consider a simple thought experiment. Imagine we have a sequencing experiment comparing two RNA populations, A and B. In this hypothetical scenario, suppose every gene that is expressed in B is expressed in A with the same number of transcripts. However, assume that sample A also contains a set of genes equal in number and expression that are not expressed in B. Thus, sample A has twice as many total expressed genes as sample B, that is, its RNA production is twice the size of sample B. Suppose that each sample is then sequenced to the same depth. Without any additional adjustment, a gene expressed in both samples will have, on average, half the number of reads from sample A, since the reads are spread over twice as many genes. Therefore, the correct normalization would adjust sample A by a factor of 2.
The hypothetical example above highlights the notion that the proportion of reads attributed to a given gene in a library depends on the expression properties of the whole sample rather than just the expression level of that gene. Obviously, the above example is artificial. However, there are biological and even technical situations where such a normalization is required. For example, if an RNA sample is contaminated, the reads that represent the contamination will take away reads from the true sample, thus dropping the number of reads of interest and offsetting the proportion for every gene. However, as we demonstrate, true biological differences in RNA composition between samples will be the main reason for normalization.
A more formal explanation for the requirement of normalization uses the following framework. Define Y gk as the observed count for gene g in library k summarized from the raw reads, μ gk as the true and unknown expression level (number of transcripts), L g as the length of gene g and N k as total number of reads for library k. We can model the expected value of Y gk as:
S k represents the total RNA output of a sample. The problem underlying the analysis of RNA-seq data is that while N k is known, S k is unknown and can vary drastically from sample to sample, depending on the RNA composition. As mentioned above, if a population has a larger total RNA output, then RNA-seq experiments will under-sample many genes, relative to another sample.
At this stage, we leave the variance in the above model for Y gk unspecified. Depending on the experimental situation, Poisson seems appropriate for technical replicates [6, 7] and Negative Binomial may be appropriate for the additional variation observed from biological replicates . It is also worth noting that, in practice, the L g is generally absorbed into the μ gk parameter and does not get used in the inference procedure. However, it has been well established that gene length biases are prominent in the analysis of gene expression .
The total RNA production, S k , cannot be estimated directly, since we do not know the expression levels and true lengths of every gene. However, the relative RNA production of two samples, f k = S k /S k' , essentially a global fold change, can more easily be determined. We propose an empirical strategy that equates the overall expression levels of genes between samples under the assumption that the majority of them are not DE. One simple yet robust way to estimate the ratio of RNA production uses a weighted trimmed mean of the log expression ratios (trimmed mean of M values (TMM)). For sequencing data, we define the gene-wise log-fold-changes as:
To robustly summarize the observed M values, we trim both the M values and the A values before taking the weighted average. Precision (inverse of the variance) weights are used to account for the fact that log fold changes (effectively, a log relative risk) from genes with larger read counts have lower variance on the logarithm scale. See Materials and methods for further details.
Normalization factors across several samples can be calculated by selecting one sample as a reference and calculating the TMM factor for each non-reference sample. Similar to two-sample comparisons, the TMM normalization factors can be built into the statistical model used to test for DE. For example, a Poisson model would modify the observed library size to an effective library size, which adjusts the modeled mean (for example, using an additional offset in a generalized linear model; see Materials and methods for further details).
We applied our method to a publicly available transcriptional profiling data set comparing several technical replicates of a liver and kidney RNA source . Figure 1a shows the distribution of M values between two technical replicates of the kidney sample after the standard normalization procedure of accounting for the total number of reads. The distribution of M values for these technical replicates is concentrated around zero. However, Figure 1b shows that log ratios between a liver and kidney sample are significantly offset towards higher expression in kidney, even after accounting for the total number of reads. Also highlighted (green line) is the distribution of observed M values for a set of housekeeping genes, showing a significant shift away from zero. If scaling to the total number of reads appropriately normalized RNA-seq data, then such a shift in the log-fold-changes is not expected. The explanation for this bias is straightforward. The M versus A plot in Figure 1c illustrates that there exists a prominent set of genes with higher expression in liver (black arrow). As a result, the distribution of M values (liver to kidney) is skewed in the negative direction. Since a large amount of sequencing is dedicated to these liver-specific genes, there is less sequencing available for the remaining genes, thus proportionally distorting the M values (and therefore, the DE calls) towards being kidney-specific. 2b1af7f3a8