# Appendix B: Statistics

Statistical manipulation is often necessary to order, define and/or organize raw data. A full analysis of statistics is beyond the scope of this work, but there are some standard analyses that anyone working in a cell biology laboratory should be aware of, and know how to perform. After data is collected, it must be ordered, or grouped according to the information which is to be sought. Data is collected in the forms:

 Type of Data Type of Entry Nominal yes or no Ordinate +, ++, +++ Numerical 0, 1, 1.3, etc.

When collected, the data may appear to be a mere collection of numbers, with little apparent trends. It is first necessary to order those numbers. One method is to count the times a number falls within a range increment. For example, in tossing a coin, one would count the number of Heads and Tails (eliminating the possibility of it landing on its edge). Coin flipping is nominal data, and thus would only have two alternatives. Should we flip the coin 100 times, we could count the number of times it lands Heads and the number of Tails. We would thus accumulate data relative to the categories available. A simple table of the grouping would be known as a frequency distribution , for example:

 Coin Face Frequency Heads 45 Tails 55 Total 100

Similarly, if we examine the following numbers; 3,5,4,2,5,6,2,4,4, several things are apparent. First, the data needs to be grouped and the first task is to establish an increment for the categories. Let us group the data according to integers, with no rounding of decimals. We can construct a table which groups the data.

 Integer Frequency Total(Integer x Frequency) 1 0 0 2 2 4 3 1 3 4 3 12 5 2 10 6 1 6 Totals 9 31

This data can be plotted as follows:

Figure B.1. Plot of frequency distribution

### MEAN, MEDIAN AND MODE

From the data, we can now define and compute three important parameters of statistics.

Definition
The mean: The average of all the values obtained. It is computed by the sum of all of the values ( x), divided by the number of values (n). The sum of all numbers is 31, while there are 9 values, thus, the mean is 3.44.

```     _	    x
M = --------
n
```

Definition
The median: The mid point in an arrangement of the categories by magnitude. Thus, the low for our data is 2, while the high is 6. The middle of this range is 4. The median is 4. It represents the middle of the possible range of categories.
Definition
The mode: The category that occurs with the highest frequency. For our data, the mode is equal to 4, since it occurs more often than any other value.

These values can now be used to characterize distribution patterns of data.

For our coin flipping, the likelihood of a Head or a Tail is equal. Another way of saying this is that there is equal probability of obtaining a Head or not obtaining a Head with each flip of the coin. When the situation exists that there is equal probability for an event as for the opposite event, the data will be graphed as a binomial distrution, and a Normal curve will result. If the coin is flipped ten times, the probability of one head and nine tails equals the probability of nine heads and one tail. The probability of two heads and eight tails equals the probability of eight heads and two tails and so on. However, the probability of the latter (two heads) is greater than the probability of the former (one head). The most likely arrangement is five heads and five tails.

When random data is arranged and displays a binomial distribution, a plot of frequency vs. occurency will result in a normal distribution curve . For an ideal set of data (i.e. no tricks, such as two headed coin, or gum on the edge of the coin), the data will be distributed in a bell shaped curve, where the median, mode and mean are equal.

This does not give an accurate indication of the deviation of the data, and in particular does not inform us of the degree of dispersion of the data about the mean. The measure of the dispersion of data is known as the Standard Deviation . It is given mathematically by the formula:

```	    	    _
(M - X)
S = sqrt(----------)
n - 1
```

This value gives a measure of the variability of the data, and in particular, how it varies from an ideal set of data generated by a random binomial distribution. In other words, how different is it from an ideal Normal Distribution. The more variable the data, the higher the value of the standard deviation.

Other measures of variability are the range (difference between minimum and maximum values), the Coefficient of Variation (Standard Deviation divided by the Mean and expressed as a Percent) and the Variance.

The variance is the deviation of several or all values from the mean and must be calculated relative to the total number of values. Variance can be calculated from the formula:

```		  _
(M - X)
V = ----------
n - 1
```

All of these calculated parameters are for a single set of data that conforms to a normal distribution. Unfortunately, biological data does not always conform in this way, and often sets of data must be compared. If the data does not fit a binomial distribution, often it fits a skewed plot known as a Poisson distribution . This distribution occurs when the probability of an event is so low, that the probability of its not occurring approaches 1. While this is a signficant statistical event in biology, details of the Poisson Distribution are left to texts on biological statistics.

Likewise, the proper handling of comparisons of multiple sets of data. Suffice it indicate that all statistics comparing multiple sets begin with calculation of the parameters detailed here, and for each set of data. For example, the standard error of the mean (also known simply as the standard error) is often used to measure distinctions among populations. It is defined as the standard deviation of a distribution of means. Thus, the mean for each population is computed and the collection of means are then used to calculate a standard deviation of those means.

Once all of these parameters are calculated, the general aim of statistical analysis is to estimate the significance of the data, and in particular the probability that the data represents effects of experimental treatment, or conversely, pure random distribution. Tests of significance (Student's t Test, Analysis of Variance and Confidence Limits) will also be left to more extensive treatment in other volumes.