Creating Magic with pixiedust

When David Robinson produced the broom package [1], he described it as an attempt to “[bridge] the gap from untidy outputs of predictions and estimations to create tidy data that is easy to manipulate with standard tools.” While broom’s vision was to use model outputs as data, his work had a happy side-effect of producing tabular output that was very near what many researchers wish to present as results. While the broom package assumes you want the model output for further analysis, the pixiedust package diverts from this assumption and provides you with the tools to customize that output into a fine looking table suitable for reports.

To illustrate the functionality of pixiedust, we will make use of a linear regression model based on the mtcars dataset. The model is defined:

fit <- lm(mpg ~ qsec + factor(am) + wt + factor(gear), 
          data = mtcars)

In base R, the model summary can be presented using the summary command, and produces output that is quasi tabular. While this summary contains many details of interest to the statistician, many of them are foreign to non-statistical audiences, and may intimidate some readers rather than inviting further reflection.

summary(fit)
## 
## Call:
## lm(formula = mpg ~ qsec + factor(am) + wt + factor(gear), data = mtcars)
## 
## Residuals:
## Gas Mileage
##     Min      1Q  Median      3Q     Max 
## -3.5064 -1.5220 -0.7517  1.3841  4.6345 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        9.3650     8.3730   1.118  0.27359    
## qsec               1.2449     0.3828   3.252  0.00317 ** 
## factor(am)Manual   3.1505     1.9405   1.624  0.11654    
## wt                -3.9263     0.7428  -5.286 1.58e-05 ***
## factor(gear)4     -0.2682     1.6555  -0.162  0.87257    
## factor(gear)5     -0.2697     2.0632  -0.131  0.89698    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.55 on 26 degrees of freedom
## Multiple R-squared:  0.8498, Adjusted R-squared:  0.8209 
## F-statistic: 29.43 on 5 and 26 DF,  p-value: 6.379e-10

When broom was released, many undoubtedly recognized the potential to use the tidy output as executive summaries of the analyses. Surely, the output below is much more consumable for the lay audience than the output above.

broom::tidy(fit)
## # A tibble: 6 × 5
##   term             estimate std.error statistic   p.value
##   <chr>               <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)         9.37      8.37      1.12  0.274    
## 2 qsec                1.24      0.383     3.25  0.00317  
## 3 factor(am)Manual    3.15      1.94      1.62  0.117    
## 4 wt                 -3.93      0.743    -5.29  0.0000158
## 5 factor(gear)4      -0.268     1.66     -0.162 0.873    
## 6 factor(gear)5      -0.270     2.06     -0.131 0.897

Thanks to broom, the hardest part of generating the tabular output is already accomplished. However, there are still a few details to be dealt with, even with the tidy output. For instance, the numeric values have too many decimal places; the column names could be spruced up a little; and we may want to direct readers’ attention to certain parts of the table that are of particular interest. Adding pixiedust makes these customizations easier and uses the familiar strategy of ggplot2 where each new customization is added on top of the others.

The process of building these tables involves an initial dusting with the dust function, and then the addition of “sprinkles” to fine tune rows, columns, or even individual cells. The initial dusting creates a presentation very similar to the broom output.

library(pixiedust)
dust(fit)
term estimate std.error statistic p.value
(Intercept) 9.3650443 8.3730161 1.1184792 0.2735903
qsec 1.2449212 0.3828479 3.2517387 0.0031681
factor(am)Manual 3.1505178 1.9405171 1.6235455 0.1165367
wt -3.9263022 0.7427562 -5.2861251 1.58e-05
factor(gear)4 -0.268163 1.6554617 -0.1619868 0.8725685
factor(gear)5 -0.2697468 2.0631829 -0.130743 0.896985


Realistically, the dust output is very similar to the broom output. Some differences are that the broom output retains the class of the variables. term is a character vector, the other vectors are numeric. When this output is dusted, however, these are all turned into character values (but with a reference to its original class). Don’t panic, though. This isn’t a disadvantage, it’s the key feature of pixiedust. dust converts the broom output into a table where each cell in the table is represented by a row (Take a look at dust(fit)$body to see what I mean). This is the process by which we get control over every last detail of the table. By the time we’re done, we’ll easily produce tables that look like this:

Term Coefficient SE T-statistic P-value
(Intercept) 9.365 8.373 1.118 0.27
qsec 1.245 0.383 3.252 0.003
factor(am)Manual 3.151 1.941 1.624 0.12
wt -3.926 0.743 -5.286 < 0.001
factor(gear)4 -0.268 1.655 -0.162 0.87
factor(gear)5 -0.27 2.063 -0.131 0.9


Okay, maybe not those exact colors. But you have to admit, they are very pixie like colors, are they not?

Formatting Cell Values

As we noted earlier, the default output of dust has far too many decimal places. In most cases, the decimal places returned probably exceed the accuracy of the values in the data. We can sprinkle the values with round or any other function to suit our needs. First, let’s take a look at the round sprinkle.

dust(fit) %>% 
  sprinkle(cols = c("estimate", "std.error", "statistic"), round = 2)
term estimate std.error statistic p.value
(Intercept) 9.37 8.37 1.12 0.2735903
qsec 1.24 0.38 3.25 0.0031681
factor(am)Manual 3.15 1.94 1.62 0.1165367
wt -3.93 0.74 -5.29 1.58e-05
factor(gear)4 -0.27 1.66 -0.16 0.8725685
factor(gear)5 -0.27 2.06 -0.13 0.896985


That already makes a big difference. We could have rounded the p-values as well, but we’ll do something different with those. We’ll use another function to format the p-values into strings. In the following code, we’ll pass a function call to the fn argument of sprinkle. There are two important aspects of this call to be aware of

  1. The function is wrapped in quote. sprinkles uses standard evaluation, and passing a function wrapped in quote allows us to delay its execution.
  2. The function pvalString is acting on value. The elements of the dust object are stored in a manner where each cell in the table is a row in a data frame, with the contents of the cell being stored as value. (Try running dust(fit)$body to explore the anatomy of the dust object). Any function you pass in the fn argument needs to act on value.
dust(fit) %>% 
  sprinkle(cols = c("estimate", "std.error", "statistic"),
           round = 3) %>% 
  sprinkle(cols = "p.value", fn = quote(pvalString(value))) 
term estimate std.error statistic p.value
(Intercept) 9.365 8.373 1.118 0.27
qsec 1.245 0.383 3.252 0.003
factor(am)Manual 3.151 1.941 1.624 0.12
wt -3.926 0.743 -5.286 < 0.001
factor(gear)4 -0.268 1.655 -0.162 0.87
factor(gear)5 -0.27 2.063 -0.131 0.9


Columns Names

After formatting the cell values, the next thing we will likely want to change about our table is the column names. The names returned by broom are deliberately generic. In a conference call in July of 2015, a listener asked Robinson if using the column name statisic made sense for so many model types, since some were F statistics, some were t and still others were z. Robinson answered that broom's focus was not on the convenience of the reader, but on the convenience of the analyst being able to quickly and easily combine the output of several models. Having a generic name made it easier for the analyst.

For the reader, the table’s column names can be modified using the sprinkle_colnames function in pixiedust. The function only has a ... argument, and may accept either named or unnamed arguments. If the arguments are named, the name matches one of the column names in the broom output, and the argument value represents the name we wish to appear in print.

dust(fit) %>% 
  sprinkle(cols = c("estimate", "std.error", "statistic"),
           round = 3) %>% 
  sprinkle(cols = "p.value", fn = quote(pvalString(value))) %>% 
  sprinkle_colnames(term = "Term", p.value = "P-value")
Term estimate std.error statistic P-value
(Intercept) 9.365 8.373 1.118 0.27
qsec 1.245 0.383 3.252 0.003
factor(am)Manual 3.151 1.941 1.624 0.12
wt -3.926 0.743 -5.286 < 0.001
factor(gear)4 -0.268 1.655 -0.162 0.87
factor(gear)5 -0.27 2.063 -0.131 0.9


Naming the arguments has advantages for reproducibility, as pixiedust will correctly assign the column names regardless of order.

dust(fit) %>% 
  sprinkle(cols = c("estimate", "std.error", "statistic"),
           round = 3) %>% 
  sprinkle(cols = "p.value", fn = quote(pvalString(value))) %>% 
  sprinkle_colnames(term = "Term", p.value = "P-value", 
                    std.error = "SE", statistic = "T-statistic",
                    estimate = "Coefficient")
Term Coefficient SE T-statistic P-value
(Intercept) 9.365 8.373 1.118 0.27
qsec 1.245 0.383 3.252 0.003
factor(am)Manual 3.151 1.941 1.624 0.12
wt -3.926 0.743 -5.286 < 0.001
factor(gear)4 -0.268 1.655 -0.162 0.87
factor(gear)5 -0.27 2.063 -0.131 0.9


If all of the columns are to be renamed, we may forego naming the arguments so long as we are careful to provide the new names in the same order they appear in the table (from left to right). If the new names are provided in the wrong order, they will be applied to the table incorrectly. Thus, it is recommended to name the arguments.

dust(fit) %>% 
  sprinkle(cols = c("estimate", "std.error", "statistic"),
           round = 3) %>% 
  sprinkle(cols = "p.value", fn = quote(pvalString(value))) %>% 
  sprinkle_colnames("Term", "Coefficient", "SE", "T-statistic", "P-value")
Term Coefficient SE T-statistic P-value
(Intercept) 9.365 8.373 1.118 0.27
qsec 1.245 0.383 3.252 0.003
factor(am)Manual 3.151 1.941 1.624 0.12
wt -3.926 0.743 -5.286 < 0.001
factor(gear)4 -0.268 1.655 -0.162 0.87
factor(gear)5 -0.27 2.063 -0.131 0.9


In the case that you provide a different number of arguments than there are columns in the table, an error is returned stating such.

dust(fit) %>% 
  sprinkle(cols = c("estimate", "std.error", "statistic"),
           round = 3) %>% 
  sprinkle(cols = "p.value", fn = quote(pvalString(value))) %>% 
  sprinkle_colnames("Term", "Coefficient", "SE", "T-statistic", "P-value", "Extra Column Name")
## Error in `$<-.data.frame`(`*tmp*`, "value", value = c("Term", "Coefficient", : replacement has 6 rows, data has 5

Replacing Values in the Table

There may be times you wish to use different values in the table than what are provided by the broom output. Some examples may be using different standard errors from a ridge regression, or perhaps you prefer to display the variance inflation factors instead of the p-value. Values can be replaced using the replace sprinkle. In this example, we’ll replace the term column with names that are a bit more friendly to the reader.

dust(fit) %>% 
  sprinkle(cols = "term", 
           replace = c("Intercept", "Quarter Mile Time", "Automatic vs. Manual",
                       "Weight", "Gears: 4 vs. 3", "Gears: 5 vs 3")) %>%
  sprinkle(cols = c("estimate", "std.error", "statistic"),
           round = 3) %>% 
  sprinkle(cols = "p.value", fn = quote(pvalString(value))) %>% 
  sprinkle_colnames("Term", "Coefficient", "SE", "T-statistic", "P-value")
Term Coefficient SE T-statistic P-value
Intercept 9.365 8.373 1.118 0.27
Quarter Mile Time 1.245 0.383 3.252 0.003
Automatic vs. Manual 3.151 1.941 1.624 0.12
Weight -3.926 0.743 -5.286 < 0.001
Gears: 4 vs. 3 -0.268 1.655 -0.162 0.87
Gears: 5 vs 3 -0.27 2.063 -0.131 0.9


Values are always replaced down the column before across the row. To illustrate, let’s replace the cells in rows 2 - 3 and columns 3 - 4 with the values 100, 200, 300, and 400. If we want the values to read in sequential order from left to right before going to the next line, we make the replacement call (we will also italicize these cells to make them easier to find)

dust(fit) %>% 
  sprinkle(rows = 2:3, cols = 3:4, 
           replace = c(100, 300, 200, 400),
           italic = TRUE) %>%
  sprinkle(cols = c("estimate", "std.error", "statistic"),
           round = 3) %>% 
  sprinkle(cols = "p.value", fn = quote(pvalString(value))) %>% 
  sprinkle_colnames("Term", "Coefficient", "SE", "T-statistic", "P-value")
Term Coefficient SE T-statistic P-value
(Intercept) 9.365 8.373 1.118 0.27
qsec 1.245 100 200 0.003
factor(am)Manual 3.151 300 400 0.12
wt -3.926 0.743 -5.286 < 0.001
factor(gear)4 -0.268 1.655 -0.162 0.87
factor(gear)5 -0.27 2.063 -0.131 0.9


Borders and Padding

For the duration of the vignette, we will use basetable as the basis of additional customizations where basetable is defined below. We are also moving out of the capabilities of the console, so we will switch over to HTML printing.

basetable <- dust(fit) %>% 
  sprinkle(cols = c("estimate", "std.error", "statistic"),
           round = 3) %>% 
  sprinkle(cols = "p.value", fn = quote(pvalString(value))) %>% 
  sprinkle_colnames(term = "Term", estimate = "Coefficient", 
                    std.error = "SE", statistic = "T-statistic", 
                    p.value = "P-value") %>% 
  sprinkle_print_method("html")

For no good reason, let’s also focus on drawing attention to the statistically significant results. Using borders, we could accomplish this by drawing a border around each of those rows. There are five sprinkles related to borders.

  1. border controls on which sides of the cells the borders are drawn.
  2. border_thickness controls how thick the borders are.
  3. border_units controls the units of measure on the thickness.
  4. border_style controls the border style (solid or dashed, etc).
  5. border_color controls the color of the border.

All of these sprinkles have default values they can take, so unless we need to customize more than one sprinkle, we need only specify one of the five in order to get all of them to take effect.

basetable %>% 
  sprinkle(rows = c(2, 4), border_color = "orchid")
Term Coefficient SE T-statistic P-value
(Intercept) 9.365 8.373 1.118 0.27
qsec 1.245 0.383 3.252 0.003
factor(am)Manual 3.151 1.941 1.624 0.12
wt -3.926 0.743 -5.286 < 0.001
factor(gear)4 -0.268 1.655 -0.162 0.87
factor(gear)5 -0.27 2.063 -0.131 0.9


If we want to eliminate the borders between cells, we have to do a little more work.

basetable %>% 
  sprinkle(rows = c(2, 4), cols = 1, 
           border = c("left", "top", "bottom"),
           border_color = "orchid") %>% 
  sprinkle(rows = c(2, 4), cols = 5,
           border = c("right", "top", "bottom"),
           border_color = "orchid") %>%
  sprinkle(rows = c(2, 4), cols = 2:4,
           border = c("top", "bottom"),
           border_color = "orchid")
Term Coefficient SE T-statistic P-value
(Intercept) 9.365 8.373 1.118 0.27
qsec 1.245 0.383 3.252 0.003
factor(am)Manual 3.151 1.941 1.624 0.12
wt -3.926 0.743 -5.286 < 0.001
factor(gear)4 -0.268 1.655 -0.162 0.87
factor(gear)5 -0.27 2.063 -0.131 0.9


We can further separate these rows by adding more padding to the cells. In this example, for simplicity, we’ll allow the lines between cells.

basetable %>% 
  sprinkle(rows = c(2, 4), border_color = "orchid",
           pad = 15)
Term Coefficient SE T-statistic P-value
(Intercept) 9.365 8.373 1.118 0.27
qsec 1.245 0.383 3.252 0.003
factor(am)Manual 3.151 1.941 1.624 0.12
wt -3.926 0.743 -5.286 < 0.001
factor(gear)4 -0.268 1.655 -0.162 0.87
factor(gear)5 -0.27 2.063 -0.131 0.9


Bold and Italic Text

A more conventional way to draw attention to these rows would be to print them in bold text.

basetable %>% 
  sprinkle(rows = c(2, 4), bold = TRUE)
Term Coefficient SE T-statistic P-value
(Intercept) 9.365 8.373 1.118 0.27
qsec 1.245 0.383 3.252 0.003
factor(am)Manual 3.151 1.941 1.624 0.12
wt -3.926 0.743 -5.286 < 0.001
factor(gear)4 -0.268 1.655 -0.162 0.87
factor(gear)5 -0.27 2.063 -0.131 0.9


The text could also be italicized either separately or concurrently. He we show the italics printed concurrently.

basetable %>% 
  sprinkle(rows = c(2, 4), bold = TRUE, italic=TRUE)
Term Coefficient SE T-statistic P-value
(Intercept) 9.365 8.373 1.118 0.27
qsec 1.245 0.383 3.252 0.003
factor(am)Manual 3.151 1.941 1.624 0.12
wt -3.926 0.743 -5.286 < 0.001
factor(gear)4 -0.268 1.655 -0.162 0.87
factor(gear)5 -0.27 2.063 -0.131 0.9


Backgrounds

Backgrounds are added using the bg sprinkle, which accepts X11 colors, hexidecimal colors, rgb colors, and for HTML rgba colors (the a specifies the transparency). To put in a background in the rows showing statistical significance, we need only specify the color.

basetable %>% 
  sprinkle(rows = c(2, 4), bg = "orchid")
Term Coefficient SE T-statistic P-value
(Intercept) 9.365 8.373 1.118 0.27
qsec 1.245 0.383 3.252 0.003
factor(am)Manual 3.151 1.941 1.624 0.12
wt -3.926 0.743 -5.286 < 0.001
factor(gear)4 -0.268 1.655 -0.162 0.87
factor(gear)5 -0.27 2.063 -0.131 0.9


If we decide that color is a little bit strong, we can lighten it up a little with the transparency. We have to look up the rgb specification for the orchid color (there are lots of web resources for this; X11 Color Names on Wikipedia is a good place to start).

basetable %>% 
  sprinkle(rows = c(2, 4), bg = "rgba(218,112,214,.5)")
Term Coefficient SE T-statistic P-value
(Intercept) 9.365 8.373 1.118 0.27
qsec 1.245 0.383 3.252 0.003
factor(am)Manual 3.151 1.941 1.624 0.12
wt -3.926 0.743 -5.286 < 0.001
factor(gear)4 -0.268 1.655 -0.162 0.87
factor(gear)5 -0.27 2.063 -0.131 0.9


If we aren’t interested in coloring just those two rows, we can apply color to the entire table with the bg_pattern sprinkle. This sprinkle accepts as many colors as you want to cycle through.

basetable %>% 
  sprinkle(bg_pattern = c("orchid", "plum"))
Term Coefficient SE T-statistic P-value
(Intercept) 9.365 8.373 1.118 0.27
qsec 1.245 0.383 3.252 0.003
factor(am)Manual 3.151 1.941 1.624 0.12
wt -3.926 0.743 -5.286 < 0.001
factor(gear)4 -0.268 1.655 -0.162 0.87
factor(gear)5 -0.27 2.063 -0.131 0.9


Font Sizes and Colors

Font sizes and colors are modified with the font_size and font_color sprinkles. We’ll employ these simultaneously to highlight our significant rows.

basetable %>% 
  sprinkle(rows = c(2, 4), 
           font_color = "orchid",
           font_size = 24,
           font_size_units = "pt")
Term Coefficient SE T-statistic P-value
(Intercept) 9.365 8.373 1.118 0.27
qsec 1.245 0.383 3.252 0.003
factor(am)Manual 3.151 1.941 1.624 0.12
wt -3.926 0.743 -5.286 < 0.001
factor(gear)4 -0.268 1.655 -0.162 0.87
factor(gear)5 -0.27 2.063 -0.131 0.9


(Woah! That was a bit too much.)

Dimensions and Alignment

In addition to the sprinkles already discussed, we can also use sprinkles to change the height, width, and alignment of cells. For illustration, we’re going to use the first three rows of columns 2-4 to show a grid of all the combinations of alignments. This requires that each cell be modified individually, so bear with me…the code is a bit long.

basetable %>% 
  sprinkle(rows = 1, cols = 2, halign = "left", valign = "top", height = 50, width = 50) %>% 
  sprinkle(rows = 1, cols = 3, halign = "center", valign = "top", height = 50, width = 50) %>% 
  sprinkle(rows = 1, cols = 4, halign = "right", valign = "top", height = 50, width = 50) %>% 
  sprinkle(rows = 2, cols = 2, halign = "left", valign = "middle", height = 50, width = 50) %>% 
  sprinkle(rows = 2, cols = 3, halign = "center", valign = "middle", height = 50, width = 50) %>% 
  sprinkle(rows = 2, cols = 4, halign = "right", valign = "middle", height = 50, width = 50) %>% 
  sprinkle(rows = 3, cols = 2, halign = "left", valign = "bottom", height = 50, width = 50) %>% 
  sprinkle(rows = 3, cols = 3, halign = "center", valign = "bottom", height = 50, width = 50) %>% 
  sprinkle(rows = 3, cols = 4, halign = "right", valign = "bottom", height = 50, width = 50)
Term Coefficient SE T-statistic P-value
(Intercept) 9.365 8.373 1.118 0.27
qsec 1.245 0.383 3.252 0.003
factor(am)Manual 3.151 1.941 1.624 0.12
wt -3.926 0.743 -5.286 < 0.001
factor(gear)4 -0.268 1.655 -0.162 0.87
factor(gear)5 -0.27 2.063 -0.131 0.9


Rotating Text

There is a sprinkle available to rotate the text in a cell. I don’t recommend using it. Rotated text is harder to read, and communicating concepts is the whole point of the table. However, sometimes it might be necessary. For our example, we’ll use the first few rows of the mtcars data set.

Notice here that when I apply the rotation, I added an argument to sprinkle in which I denoted that the rotation should apply to the head of the table. The head and body of the table are stored separately in the dust object and all of the sprinkles may be applied to either part of the table.

dust(Formaldehyde) %>%
  sprinkle(cols = c("mpg", "disp", "drat", "qsec"),
           round = 2) %>%
  sprinkle(rows = 1,
           rotate_degree = -90,
           height = 60,
           part = "head") %>%
  sprinkle_print_method("html")
carb optden
0.1 0.086
0.3 0.269
0.5 0.446
0.6 0.538
0.7 0.626
0.9 0.782


References

  1. Robinson, David. “broom: An R Package for Converting Statistical Analysis Objects Into Tidy Data Frames,” Cornell University Library, https://arxiv.org/pdf/1412.3565v2.pdf.