Farrago

7/24/2023

TPM has a very nice interpretation when you’re looking at transcript abundances. After you compute that, you simply scale by one million because the proportion is often very small and a pain to deal with. To adjust for this, simply divide by the sum of all rates and this gives the proportion of transcripts in your sample. As you might immediately notice, this number is also dependent on the total number of fragments sequenced. Since we are interested in taking the length into consideration, a natural measurement is the rate, counts per base ( ). Transcripts per million (TPM) is a measurement of the proportion of transcripts in your pool of RNA. Doing so allows the summation of expression across features to get the expression of a group of features (think a set of transcripts which make up a gene).Īgain, the methods in this section allow for comparison of features with different length WITHIN a sample but not BETWEEN samples. Therefore, in order to compare features of different length you should normalize counts by the length of the feature. I’m not sure where this unit first appeared, but I’ve seen it used with edgeR and talked about briefly in the limma voom paper.Īs noted in the counts section, the number of fragments you see from a feature depends on its length. This unit is related to the FPKM without length normalization and a factor of : Counts per millionĬounts per million (CPM) mapped reads are counts scaled by the number of fragments you sequenced ( ) times one million. Thus, the effective counts are scaling the observed counts up. The intuition here is that if the effective length is much shorter than the actual length, then in an experiment with no bias you would expect to see more counts.

When eXpress came out, they began reporting “effective counts.” This is basically the same thing as standard counts, with the difference being that they are adjusted for the amount of bias in the experiment. you can’t sum isoform counts to get gene counts).Ĭounts are often used by differential expression methods since they are naturally represented by a counting model, such as a negative binomial (NB2). This means you can’t sum the counts over a set of features to get the expression of that set (e.g. Since counts are NOT scaled by the length of the feature, all units in this category are not comparable within a sample without adjusting for the feature length. If the abundance estimation method you’re using incorporates sequence bias modeling (such as eXpress or Cufflinks), the bias is often incorporated into the effective length by making the feature shorter or longer depending on the effect of the bias. Where is the mean of the fragment length distribution which was learned from the aligned read. In practice, the effective length is usually computed as: Effective length refers to the number of possible start sites a feature could have generated a fragment of that particular length.

These numbers are heavily dependent on two things: (1) the amount of fragments you sequenced (this is related to relative abundances) and (2) the length of the feature, or more appropriately, the effective length. I’ll refer to counts by the random variable. “Counts” usually refers to the number of reads that align to a particular feature. Unfortunately, with alternative splicing you do not directly observe, so often is used, which is estimated using the EM algorithm by a method like eXpress, RSEM, Sailfish, Cufflinks, or one of many other tools. gene, isoform, exon).įinally, I use the random variable to denote the counts you observe from a feature of interest. When saying “feature”, I’m referring to an expression feature, by which I mean a genomic region containing a sequence that can normally appear in an RNA-Seq experiment (e.g. The concept of counting is the same with either type of read, as each read represents a fragment that was sequenced. Throughout this post “read” refers to both single-end or paired-end reads. This is a result of RNA-Seq being a relative measurement, not an absolute one. The first thing one should remember is that without between sample normalization (a topic for a later post), NONE of these units are comparable across experiments.

I’ll try to clear up a bit of the confusion here. This post covers the units used in RNA-Seq that are, unfortunately, often misused and misunderstood.

0 Comments

Author

Archives

Categories

Farrago

Leave a Reply.