MCS 142
Section 1.2: Describing Distributions with Numbers (Numerical Summaries)
Measuring “the” center
The median of a data set: Sort the n values in order;
the median M is the middle value if n is odd;
the median M is the sum of the middle two values divided by 2 if n is even.
E.g., the median of 1, 3, 4, 17 is M = (3+4)/2 = 3.5.
The median of 1, 3, 4, 17, 20 is M = 4.
The mean of a data set: x-bar is the sum of the values divided by the number of them.
E.g., the mean of 1, 3, 4, 17 is (1 + 3 + 4 + 17)/4 = 25/4 = 6.25.
The mean of 1, 3, 4, 17, 20 is (1 + 3 + 4 + 17 + 20)/5 = 9.
Properties: The mean is sensitive to outliers; the median is not.
The mean is the “center of mass” of the data. The median is the middle value.
Mean > median indicates skewness to the right.
Quartiles, percentiles
Idea/example: The 70th percentile is “the” value below which 70% of the values lie.
The median is the 50th percentile. The 25th and 75th percentiles are also called the first quartile Q1 and third quartile Q3, respectively. Exact definitions vary slightly. For us,
Q1 = median of values preceding the middle of the list of data in ascending order.
Q3 = median of values following the middle of the list of data in ascending order.
E.g., for 1, 3, 4, 17, Q1 = (1 + 3)/2 = 2; Q3 = (4 + 17)/2 = 10.5.
For 1, 3, 4, 17, 20, Q1 = (1 + 3)/2 = 2; Q3 = (17 + 20)/2 = 18.5.
Gustavus and other colleges report middle-50% ACT scores as the range from Q1 to Q3,
e.g., 22—28.
Measuring the spread (variability, dispersion)
The range of a data set is its maximum – its minimum.
E.g., the range of 1, 3, 4, 17 is 17 – 1 = 16.
The range of 1, 3, 4, 17, 20 is 20 – 1 = 19.
The interquartile range is IQR = Q3 - Q1.
E.g., for 1, 3, 4, 17, IQR = 10.5 – 2 = 8.5
For 1, 3, 4, 17, 20, IQR = 18.5 – 2 = 16.5.
Rule of thumb for an outlier: x < Q1 -1.5 IQR or x > Q3 + 1.5 IQR.
E.g., for 1, 3, 4, 17, Q3 + 1.5 IQR = 10.5 + 1.5 * 8.5 = 23.25, so 17 is not(!) an
outlier by this rule. (But here the values of Q3 and IQR both involved 17.)
The (sample) variance
s2of
n data is the sum of the squared
deviations from the mean divided by n – 1. The (sample)
standard
deviation s is its square root.
The (population) variance σ2of n data is the sum of the squared deviations from the mean divided by n. The (population) standard deviation σ is its square root.
E.g., for
1, 3, 4, 17, s2 = [(1-6.25)2 + (3-6.25 )2
+ (4-6.25 )2+ (17-6.25 )2]/(4-1)
= 52.91666… and s = 7.27 (approx.).
For 1, 3,
4, 17, 20, s2 = 39.6875 and s
= 6.30 (approx.).
Properties: The variance’s dimensions are unit2, while the standard deviation is measured in the same units as the data. The variance and standard deviation are sensitive to outliers. The interquartile range is not.
Extra info: The mean absolute deviation does not
introduce the distortion of squaring the deviations. MAD
= sum of absolute values of the deviations from the mean divided by n.
The five-number summary: min, Q1, M, Q3, max.
E.g., for 1, 3, 4, 17 it is 1, 2, 3.5, 10.5, 17.
For 1, 3, 4, 17, 20 it is 1, 2, 10.5, 18.5, 20.
(It is weird to use the 5-number summary for 4 or 5 data.)
A boxplot (box-and-whiskers plot) pictures the 5-number summary. ((Draw one here.))
A modified boxplot puts the “whiskers” at the smallest and largest values within 1.5 IQR of the first and third quartiles and plots the outliers as points or bubbles.
Side-by-side boxplots facilitate the comparison of two or more distributions.
IPS 5/e p. 61 compares income distributions for different levels of education.
As a result of a linear transformation y = a + b x of the data, the numerical summaries change as follows:
y-bar = a + b x-bar
sy2 = b2
sx2
sy = | b | sx .