Math 430: Lecture 3a

Summary Statistics and Data Visualizations

Professor Catalina Medina

Data

glimpse(babies)

Rows: 1,236
Columns: 8
$ case      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1…
$ bwt       <int> 120, 113, 128, 123, 108, 136, 138, 132, 120, 143, 140, 144, …
$ gestation <int> 284, 282, 279, NA, 282, 286, 244, 245, 289, 299, 351, 282, 2…
$ parity    <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALS…
$ age       <int> 27, 33, 28, 36, 23, 25, 33, 23, 25, 30, 27, 32, 23, 36, 30, …
$ height    <int> 62, 64, 64, 69, 67, 62, 62, 65, 62, 66, 68, 64, 63, 61, 63, …
$ weight    <int> 100, 135, 115, 190, 125, 93, 178, 140, 125, 136, 120, 124, 1…
$ smoke     <lgl> FALSE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE,…

?babies

case id number

bwt birthweight, in ounces

gestation length of gestation, in days

parity binary indicator for a first pregnancy (0 = first pregnancy)

age mother’s age in years

height mother’s height in inches

weight mother’s weight in pounds

smoke binary indicator for whether the mother smokes

Summary stats for categorical variables

Counts and proportions

library(janitor)
tabyl(babies, smoke)

 smoke   n     percent valid_percent
 FALSE 742 0.600323625     0.6052202
  TRUE 484 0.391585761     0.3947798
    NA  10 0.008090615            NA

NA means missing data. We have 10 missing data points for the smoke variable

Summary stats for numeric variable

Mean

summarize(babies, mean(bwt))

# A tibble: 1 × 1
  `mean(bwt)`
        <dbl>
1        120.

mean(babies$bwt)

[1] 119.5769

Mean with missing values

summarize(babies, mean(bwt, na.rm = TRUE))

# A tibble: 1 × 1
  `mean(bwt, na.rm = TRUE)`
                      <dbl>
1                      120.

mean(babies$bwt, na.rm = TRUE)

[1] 119.5769

Median

summarize(babies, median(bwt))

# A tibble: 1 × 1
  `median(bwt)`
          <dbl>
1           120

median(babies$bwt)

[1] 120

Minimum

summarize(babies, min(bwt))

# A tibble: 1 × 1
  `min(bwt)`
       <int>
1         55

min(babies$bwt)

[1] 55

In a similar fashion maxiumum can be found by using the max() function.

Standard deviation

summarize(babies, sd(bwt))

# A tibble: 1 × 1
  `sd(bwt)`
      <dbl>
1      18.2

sd(babies$bwt)

[1] 18.23645

Variance

summarize(babies, var(bwt))

# A tibble: 1 × 1
  `var(bwt)`
       <dbl>
1       333.

var(babies$bwt)

[1] 332.5682

We can compute multiple summary statistics at once

summarize(babies, mean(bwt), sd(bwt), mean(age, na.rm = TRUE))

# A tibble: 1 × 3
  `mean(bwt)` `sd(bwt)` `mean(age, na.rm = TRUE)`
        <dbl>     <dbl>                     <dbl>
1        120.      18.2                      27.3

We can even look at summary statistics for different groups

group_by(babies, smoke) |> 
  summarize(mean(bwt))

# A tibble: 3 × 2
  smoke `mean(bwt)`
  <lgl>       <dbl>
1 FALSE        123.
2 TRUE         114.
3 NA           127.

The pipe symbol “|>” essentially carries what was on that line into the next function.

Notice I did not have to provide babies as the first argument to summarize().

Data Visualizations

are graphical representations of data
use different colors, shapes, and the coordinate system to summarize data
can tell a story or can be useful for exploring data

Bar plot

ggplot(babies)

Bar plot

ggplot(babies, aes(x = smoke))

Bar plot

ggplot(babies, aes(x = smoke)) +
  geom_bar()

Bar plot

filter(babies, !is.na(smoke)) |> 
ggplot(aes(x = smoke)) +
  geom_bar()

Histogram

ggplot(babies)

Histogram

ggplot(babies, aes(x = bwt))

Histogram

ggplot(babies, aes(x = bwt)) +
  geom_histogram()

Binwidth

ggplot(babies, aes(x = bwt)) +
  geom_histogram(
    binwidth = 5, 
    color = "darkgreen", 
    fill = "lightgray"
  )

When data display a skewed distribution we rely on median rather than the mean to understand the center of the distribution.

Looking at Relationships

So far we seen bar plots and histograms both of which are useful for visualizing a single categorical and numerical variables respectively.

We are often interested in looking at relationships between two variables. We have statistical tests to examine such relationships. However, visualizations can often help us explore if such relationships are worth looking into.

Standardized Bar Plots

ggplot(
  data = babies,
  aes(x = smoke, fill = parity)
) + 
  geom_bar(position = "fill")

Notice standardized bar plots are actually plotting proportions

Standardized Bar Plots

ggplot(
  data = babies,
  aes(x = smoke, fill = parity)
) + 
  geom_bar(position = "fill") +
  labs(
    x = "Smoke",
    y = "Proportion",
    fill = "First pregnancy"
  )

We can use the labs() function to specify labels

Dodged Bar Plot

ggplot(
  data = babies,
  aes(x = smoke, fill = parity)
) + 
  geom_bar(position = "dodge")

Side-by-Side Boxplots

ggplot(
  babies,
  aes(x = smoke, y = bwt))  +
  geom_boxplot()

Understanding Each Box

The horizontal line in the box represents the median.
The box represents the middle 50% of the data with Q3 on the upper end and Q1 on the lower end.

Understanding Each Box

Whiskers extend from the box. They can extend up to 1.5 IQR away from the box (i.e. away from Q1 and Q3).
The points are potential outliers that represent babies with really low or high birth weight.

Scatter plots

ggplot(
  babies,
  aes(x = gestation, y = bwt)
)  +
  geom_point()

Scatter plots

ggplot(
  babies,
  aes(x = gestation, y = bwt)
)  +
  geom_point() +
  labs(
    x = "Gestation (days)",
    y = "Birth weight (ounces)"
  )

Length of gestation can possibly explain a baby’s birth weight.

Explanatory variable and is shown on the x-axis.

Response variable and is shown on the y-axis.

Linear Relationship

ggplot(
  babies,
  aes(x = gestation, y = bwt)
)  +
  geom_point() +
  labs(
    x = "Gestation (days)",
    y = "Birth weight (ounces)"
  ) +
  geom_smooth(
    method = "lm", 
    se = FALSE
  )

Next we will start statistical modeling during which we will numerically define the relationship between gestation and birth weight. For now we can say that this relationship looks positive and moderate.

Meet Palmer Penguins¹

Data

glimpse(penguins)

Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Visualizing Three Variables

ggplot(penguins, aes(x = body_mass_g, y = bill_length_mm, color = species)) +
  geom_point()

We can use color to consider a third variable

Visualizing Three Variables

ggplot(penguins, aes(x = body_mass_g, y = bill_length_mm, shape = species)) +
  geom_point(size = 3)

We can even use shapes

Visualizing Three Variables

ggplot(penguins, aes(x = body_mass_g, y = bill_length_mm, shape = species, color = species)) +
  geom_point(size = 3)

We can even use both!

Summary: Plots by data scenario

Single numeric

histogram (shape)
box plot (summary stats)

Single categorical

bar plots

Two numeric

scatter plot

Two categorical

standardized bar plot (proportions)
dodged bar plot (counts)

Two numeric one categorical

scatter plot with color/shape

One numeric one categorical

side-by-side box plot / histogram

Themes

theme_gray()
theme_bw()
theme_dark()
theme_classic()
theme_minimal()

ggplot(
  penguins,
  aes(
    x = bill_depth_mm,
    y = bill_length_mm,
    color = species
  )
) +
  geom_point() +
  labs(
    x = "Bill Depth (mm)", 
    y = "Bill Length (mm)", 
    title = "Palmer Penguins"
) +
  theme_gray()

Theme gray is the default theme in ggplot.

ggplot(
  penguins,
  aes(
    x = bill_depth_mm,
    y = bill_length_mm,
    color = species
  )
) +
  geom_point() +
  labs(
    x = "Bill Depth (mm)", 
    y = "Bill Length (mm)", 
    title = "Palmer Penguins"
) +
  theme_bw()

ggplot(
  penguins,
  aes(
    x = bill_depth_mm,
    y = bill_length_mm,
    color = species
  )
) +
  geom_point() +
  labs(
    x = "Bill Depth (mm)", 
    y = "Bill Length (mm)", 
    title = "Palmer Penguins"
) +
  theme_dark()

ggplot(
  penguins,
  aes(
    x = bill_depth_mm,
    y = bill_length_mm,
    color = species
  )
) +
  geom_point() +
  labs(
    x = "Bill Depth (mm)", 
    y = "Bill Length (mm)", 
    title = "Palmer Penguins"
) +
  theme_classic()

ggplot(
  penguins,
  aes(
    x = bill_depth_mm,
    y = bill_length_mm,
    color = species
  )
) +
  geom_point() +
  labs(
    x = "Bill Depth (mm)", 
    y = "Bill Length (mm)", 
    title = "Palmer Penguins"
) +
  theme_minimal()

Math 430: Lecture 3a

Data

Summary stats for categorical variables

Counts and proportions

Summary stats for numeric variable

Mean

Mean with missing values

Median

Minimum

Standard deviation

Variance

We can compute multiple summary statistics at once

We can even look at summary statistics for different groups

Data Visualizations

Bar plot

Bar plot

Bar plot

Bar plot

Bar plot

Histogram

Histogram

Histogram

Histogram

Binwidth

Looking at Relationships

Standardized Bar Plots

Standardized Bar Plots

Dodged Bar Plot

Side-by-Side Boxplots

Understanding Each Box

Understanding Each Box

Scatter plots

Scatter plots

Linear Relationship

Meet Palmer Penguins1

Data

Visualizing Three Variables

Visualizing Three Variables

Visualizing Three Variables

Summary: Plots by data scenario

Themes

Meet Palmer Penguins¹