Introduction to Data Analysis and R: Measures of Centrality

Measures of Centrality

Measures of centrality, also called measures of central tendency, "describe the extent to which the values of a variable are alike." (Wallace & Van Fleet, 2012, p.293) You are likely already familiar with them, as the most common of these measures are mean (a.k.a. arithmetic mean), median, and mode. "Average" is a word commonly attributed to measures of centrality, but as it can refer to any of them, I recommend against using it in favor of the more specific term for which ever measure you're using. Measures of centrality are used to analyze measurement variables.

Mean (Arithmetic Mean)

The mean is the sum of observations divided by the total number of observations. It can be expressed as:

$\bar{x}=\frac{\sum x}{n}$

Don't be intimidated, it breaks down easily. The x with the bar across the top is a widely used symbol for the mean. (This is sometimes referred to as the "sample mean"; statisticians often use mu ( $\mu$ ), as you saw in the video you just watched, which is the notation for the "population mean". Since we're almost never working with an entire population, merely a sample of it, you can assume we're talking about the sample mean in this module unless otherwise stated.) The sigma ( $\Sigma$ ) is the symbol for summation and means to add together whatever elements follow it, which in this case is all the values for the variable we're looking at (x). The n is the total number of values for this variable. So the equation can be rewritten as:

$\bar{x}=\frac{\sum x}{n}=\frac{x_{1}+x_{2}+x_{3}...+x_{n}}{n}$

Median

If you rearrange the values of your observations and sort them from lowest to highest, the median is the value in the middle (if the total number of observations is odd) or the mean of the two middle values (if the total number of observations is even).

Mode

The mode is the value that is most common in your dataset. The mode is generally not as useful to determine as either the mean or the median, but it can be helpful when working with multimodal distributions.

Examples and Limitations

[Adapted from Vickers 2010, pp. 4-5]

Let's imagine that we want to examine the salaries of a group of people in an office building.

Person	Salary
1	$30,000
2	$40,000
3	$35,000
4	$50,000
5	$45,000
6	$85,000
7	$30,000
8	$1,000,000

The mean we can calculate pretty easily by adding together all the salaries and then dividing that sum by the total number of people. When we do that, we get a mean salary of $164,375. However, we are seeing here one of the main limitations of the mean: it doesn't very well represent samples that are highly skewed to one extreme or the other (such skewed data points are considered outliers).

When we have skewed datasets like this, it can often be more helpful to know what the median is. To find this, we would sort our list of salaries in order from smallest to largest (or vice versa), and find the midpoint. In our sample of 8 people, this would be halfway between the fourth and fifth highest salaries, or $42,500. This is a much better representation of our data than our highly skewed mean.

As mentioned earlier, the mode is typically not as helpful as either of our other two centrality measures, but it's easily calculated by finding the salary amount that pops up most often. In our sample, this is $30,000.

R - Centrality Measures

(Don't hesitate to use the player controls to pause, rewind, slow down the video as needed! A thorough understanding of the concepts is vastly preferable to just speeding through.)

Supplemental Readings

Learning Statistics with R, chapter 5.1 - Measures of central tendency
(Especially read ch. 5.1.6, on trimmed means)
Learning Statistics with R, chapter 5.8.1 - Handling missing values

Centrality Measures Practice

You will be using the "weatherData" dataset (linked below) to answer the questions in this quiz.

weatherData