Math 430: Lecture 3a

Summary Statistics and Data Visualizations

Professor Catalina Medina

Data

glimpse(babies)
Rows: 1,236
Columns: 8
$ case      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ bwt       <int> 120, 113, 128, 123, 108, 136, 138, 132, 120, 143, 140, 144, …
$ gestation <int> 284, 282, 279, NA, 282, 286, 244, 245, 289, 299, 351, 282, 2…
$ parity    <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ age       <int> 27, 33, 28, 36, 23, 25, 33, 23, 25, 30, 27, 32, 23, 36, 30, …
$ height    <int> 62, 64, 64, 69, 67, 62, 62, 65, 62, 66, 68, 64, 63, 61, 63, …
$ weight    <int> 100, 135, 115, 190, 125, 93, 178, 140, 125, 136, 120, 124, 1…
$ smoke     <lgl> FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE,…

?babies

case id number

bwt birthweight, in ounces

gestation length of gestation, in days

parity binary indicator for a first pregnancy (0 = first pregnancy)

age mother’s age in years

height mother’s height in inches

weight mother’s weight in pounds

smoke binary indicator for whether the mother smokes

Summary stats for categorical variables

Counts and proportions

library(janitor)
tabyl(babies, smoke)
 smoke   n     percent valid_percent
 FALSE 742 0.600323625     0.6052202
  TRUE 484 0.391585761     0.3947798
    NA  10 0.008090615            NA

NA means missing data. We have 10 missing data points for the smoke variable

Summary stats for numeric variable

Mean

summarize(babies, mean(bwt))
# A tibble: 1 × 1
  `mean(bwt)`
        <dbl>
1        120.
mean(babies$bwt)
[1] 119.5769

Mean with missing values

summarize(babies, mean(bwt, na.rm = TRUE))
# A tibble: 1 × 1
  `mean(bwt, na.rm = TRUE)`
                      <dbl>
1                      120.
mean(babies$bwt, na.rm = TRUE)
[1] 119.5769

Median

summarize(babies, median(bwt))
# A tibble: 1 × 1
  `median(bwt)`
          <dbl>
1           120
median(babies$bwt)
[1] 120

Minimum

summarize(babies, min(bwt))
# A tibble: 1 × 1
  `min(bwt)`
       <int>
1         55
min(babies$bwt)
[1] 55

In a similar fashion maxiumum can be found by using the max() function.

Standard deviation

summarize(babies, sd(bwt))
# A tibble: 1 × 1
  `sd(bwt)`
      <dbl>
1      18.2
sd(babies$bwt)
[1] 18.23645

Variance

summarize(babies, var(bwt))
# A tibble: 1 × 1
  `var(bwt)`
       <dbl>
1       333.
var(babies$bwt)
[1] 332.5682

We can compute multiple summary statistics at once

summarize(babies, mean(bwt), sd(bwt), mean(age, na.rm = TRUE))
# A tibble: 1 × 3
  `mean(bwt)` `sd(bwt)` `mean(age, na.rm = TRUE)`
        <dbl>     <dbl>                     <dbl>
1        120.      18.2                      27.3

We can even look at summary statistics for different groups

group_by(babies, smoke) |> 
  summarize(mean(bwt))
# A tibble: 3 × 2
  smoke `mean(bwt)`
  <lgl>       <dbl>
1 FALSE        123.
2 TRUE         114.
3 NA           127.

The pipe symbol “|>” essentially carries what was on that line into the next function.

Notice I did not have to provide babies as the first argument to summarize().

Data Visualizations

  • are graphical representations of data

  • use different colors, shapes, and the coordinate system to summarize data

  • can tell a story or can be useful for exploring data

Bar plot

Bar plot

ggplot(babies)

Bar plot

ggplot(babies, aes(x = smoke)) 

Bar plot

ggplot(babies, aes(x = smoke)) +
  geom_bar()

Bar plot

filter(babies, !is.na(smoke)) |> 
ggplot(aes(x = smoke)) +
  geom_bar()

Histogram

Histogram

ggplot(babies)

Histogram

ggplot(babies, aes(x = bwt))

Histogram

ggplot(babies, aes(x = bwt)) +
  geom_histogram()

Binwidth

ggplot(babies, aes(x = bwt)) +
  geom_histogram(
    binwidth = 5, 
    color = "darkgreen", 
    fill = "lightgray"
  )

When data display a skewed distribution we rely on median rather than the mean to understand the center of the distribution.

Looking at Relationships

So far we seen bar plots and histograms both of which are useful for visualizing a single categorical and numerical variables respectively.

We are often interested in looking at relationships between two variables. We have statistical tests to examine such relationships. However, visualizations can often help us explore if such relationships are worth looking into.

Standardized Bar Plots

ggplot(
  data = babies,
  aes(x = smoke, fill = parity)
) + 
  geom_bar(position = "fill")

Notice standardized bar plots are actually plotting proportions

Standardized Bar Plots

ggplot(
  data = babies,
  aes(x = smoke, fill = parity)
) + 
  geom_bar(position = "fill") +
  labs(
    x = "Smoke",
    y = "Proportion",
    fill = "First pregnancy"
  )

We can use the labs() function to specify labels

Dodged Bar Plot

ggplot(
  data = babies,
  aes(x = smoke, fill = parity)
) + 
  geom_bar(position = "dodge")

Side-by-Side Boxplots

ggplot(
  babies,
  aes(x = smoke, y = bwt))  +
  geom_boxplot() 

Understanding Each Box

  • The horizontal line in the box represents the median.
  • The box represents the middle 50% of the data with Q3 on the upper end and Q1 on the lower end.

Understanding Each Box

  • Whiskers extend from the box. They can extend up to 1.5 IQR away from the box (i.e. away from Q1 and Q3).
  • The points are potential outliers that represent babies with really low or high birth weight.

Scatter plots

ggplot(
  babies,
  aes(x = gestation, y = bwt)
)  +
  geom_point()

Scatter plots

ggplot(
  babies,
  aes(x = gestation, y = bwt)
)  +
  geom_point() +
  labs(
    x = "Gestation (days)",
    y = "Birth weight (ounces)"
  )

Length of gestation can possibly explain a baby’s birth weight.

Explanatory variable and is shown on the x-axis.

Response variable and is shown on the y-axis.

Linear Relationship

ggplot(
  babies,
  aes(x = gestation, y = bwt)
)  +
  geom_point() +
  labs(
    x = "Gestation (days)",
    y = "Birth weight (ounces)"
  ) +
  geom_smooth(
    method = "lm", 
    se = FALSE
  )

Next we will start statistical modeling during which we will numerically define the relationship between gestation and birth weight. For now we can say that this relationship looks positive and moderate.

Meet Palmer Penguins1

Data

glimpse(penguins)
Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Visualizing Three Variables

ggplot(penguins, aes(x = body_mass_g, y = bill_length_mm, color = species)) +
  geom_point()

We can use color to consider a third variable

Visualizing Three Variables

ggplot(penguins, aes(x = body_mass_g, y = bill_length_mm, shape = species)) +
  geom_point(size = 3)

We can even use shapes

Visualizing Three Variables

ggplot(penguins, aes(x = body_mass_g, y = bill_length_mm, shape = species, color = species)) +
  geom_point(size = 3)

We can even use both!

Summary: Plots by data scenario

Single numeric

  • histogram (shape)
  • box plot (summary stats)

Single categorical

  • bar plots

Two numeric

  • scatter plot

Two categorical

  • standardized bar plot (proportions)
  • dodged bar plot (counts)

Two numeric one categorical

  • scatter plot with color/shape

One numeric one categorical

  • side-by-side box plot / histogram

Themes

ggplot(
  penguins,
  aes(
    x = bill_depth_mm,
    y = bill_length_mm,
    color = species
  )
) +
  geom_point() +
  labs(
    x = "Bill Depth (mm)", 
    y = "Bill Length (mm)", 
    title = "Palmer Penguins"
) +
  theme_gray()

Theme gray is the default theme in ggplot.

ggplot(
  penguins,
  aes(
    x = bill_depth_mm,
    y = bill_length_mm,
    color = species
  )
) +
  geom_point() +
  labs(
    x = "Bill Depth (mm)", 
    y = "Bill Length (mm)", 
    title = "Palmer Penguins"
) +
  theme_bw()

ggplot(
  penguins,
  aes(
    x = bill_depth_mm,
    y = bill_length_mm,
    color = species
  )
) +
  geom_point() +
  labs(
    x = "Bill Depth (mm)", 
    y = "Bill Length (mm)", 
    title = "Palmer Penguins"
) +
  theme_dark()

ggplot(
  penguins,
  aes(
    x = bill_depth_mm,
    y = bill_length_mm,
    color = species
  )
) +
  geom_point() +
  labs(
    x = "Bill Depth (mm)", 
    y = "Bill Length (mm)", 
    title = "Palmer Penguins"
) +
  theme_classic()

ggplot(
  penguins,
  aes(
    x = bill_depth_mm,
    y = bill_length_mm,
    color = species
  )
) +
  geom_point() +
  labs(
    x = "Bill Depth (mm)", 
    y = "Bill Length (mm)", 
    title = "Palmer Penguins"
) +
  theme_minimal()