An R Introduction to Statistics

Frequency Distribution of Quantitative Data

The frequency distribution of a data variable is a summary of the data occurrence in a collection of non-overlapping categories.

Example

In the data set faithful, the frequency distribution of the eruptions variable is the summary of eruptions according to some classification of the eruption durations.

Problem

Find the frequency distribution of the eruption durations in faithful.

Solution

The solution consists of the following steps:

  1. We first find the range of eruption durations with the range function. It shows that the observed eruptions are between 1.6 and 5.1 minutes in duration.
    > duration = faithful$eruptions 
    > range(duration) 
    [1] 1.6 5.1
  2. Break the range into non-overlapping sub-intervals by defining a sequence of equal distance break points. If we round the endpoints of the interval [1.6, 5.1] to the closest half-integers, we come up with the interval [1.5, 5.5]. Hence we set the break points to be the half-integer sequence { 1.5, 2.0, 2.5, ... }.
    > breaks = seq(1.5, 5.5, by=0.5)    # half-integer sequence 
    > breaks 
    [1] 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
  3. Classify the eruption durations according to the half-unit-length sub-intervals with cut. As the intervals are to be closed on the left, and open on the right, we set the right argument as FALSE.
    > duration.cut = cut(duration, breaks, right=FALSE)
  4. Compute the frequency of eruptions in each sub-interval with the table function.
    > duration.freq = table(duration.cut)

Answer

The frequency distribution of the eruption duration is:

> duration.freq 
duration.cut 
[1.5,2) [2,2.5) [2.5,3) [3,3.5) [3.5,4) [4,4.5) [4.5,5) 
     51      41       5       7      30      73      61 
[5,5.5) 
      4

Enhanced Solution

We apply the cbind function to print the result in column format.

> cbind(duration.freq) 
        duration.freq 
[1.5,2)            51 
[2,2.5)            41 
[2.5,3)             5 
[3,3.5)             7 
[3.5,4)            30 
[4,4.5)            73 
[4.5,5)            61 
[5,5.5)             4

Note

Per R documentation, you are advised to use the hist function to find the frequency distribution for performance reasons.

Exercise

  1. Find the frequency distribution of the eruption waiting periods in faithful.
  2. Find programmatically the duration sub-interval that has the most eruptions.