pixiedust
When David Robinson produced the
broom
package [1], he described it as an attempt to
“[bridge] the gap from untidy outputs of predictions and estimations to
create tidy data that is easy to manipulate with standard tools.” While
broom
’s vision was to use model outputs as data, his work
had a happy side-effect of producing tabular output that was very near
what many researchers wish to present as results. While the
broom
package assumes you want the model output for further
analysis, the pixiedust
package diverts from this
assumption and provides you with the tools to customize that output into
a fine looking table suitable for reports.
To illustrate the functionality of pixiedust
, we will
make use of a linear regression model based on the mtcars
dataset. The model is defined:
In base R, the model summary can be presented using the
summary
command, and produces output that is quasi tabular.
While this summary contains many details of interest to the
statistician, many of them are foreign to non-statistical audiences, and
may intimidate some readers rather than inviting further reflection.
##
## Call:
## lm(formula = mpg ~ qsec + factor(am) + wt + factor(gear), data = mtcars)
##
## Residuals:
## Gas Mileage
## Min 1Q Median 3Q Max
## -3.5064 -1.5220 -0.7517 1.3841 4.6345
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.3650 8.3730 1.118 0.27359
## qsec 1.2449 0.3828 3.252 0.00317 **
## factor(am)Manual 3.1505 1.9405 1.624 0.11654
## wt -3.9263 0.7428 -5.286 1.58e-05 ***
## factor(gear)4 -0.2682 1.6555 -0.162 0.87257
## factor(gear)5 -0.2697 2.0632 -0.131 0.89698
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.55 on 26 degrees of freedom
## Multiple R-squared: 0.8498, Adjusted R-squared: 0.8209
## F-statistic: 29.43 on 5 and 26 DF, p-value: 6.379e-10
When broom
was released, many undoubtedly recognized the
potential to use the tidy output as executive summaries of the analyses.
Surely, the output below is much more consumable for the lay audience
than the output above.
## # A tibble: 6 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 9.37 8.37 1.12 0.274
## 2 qsec 1.24 0.383 3.25 0.00317
## 3 factor(am)Manual 3.15 1.94 1.62 0.117
## 4 wt -3.93 0.743 -5.29 0.0000158
## 5 factor(gear)4 -0.268 1.66 -0.162 0.873
## 6 factor(gear)5 -0.270 2.06 -0.131 0.897
Thanks to broom
, the hardest part of generating the
tabular output is already accomplished. However, there are still a few
details to be dealt with, even with the tidy output. For instance, the
numeric values have too many decimal places; the column names could be
spruced up a little; and we may want to direct readers’ attention to
certain parts of the table that are of particular interest. Adding
pixiedust
makes these customizations easier and uses the
familiar strategy of ggplot2
where each new customization
is added on top of the others.
The process of building these tables involves an initial dusting with
the dust
function, and then the addition of “sprinkles” to
fine tune rows, columns, or even individual cells. The initial dusting
creates a presentation very similar to the broom
output.
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 9.3650443 | 8.3730161 | 1.1184792 | 0.2735903 |
qsec | 1.2449212 | 0.3828479 | 3.2517387 | 0.0031681 |
factor(am)Manual | 3.1505178 | 1.9405171 | 1.6235455 | 0.1165367 |
wt | -3.9263022 | 0.7427562 | -5.2861251 | 1.58e-05 |
factor(gear)4 | -0.268163 | 1.6554617 | -0.1619868 | 0.8725685 |
factor(gear)5 | -0.2697468 | 2.0631829 | -0.130743 | 0.896985 |
Realistically, the dust
output is very similar to the
broom
output. Some differences are that the
broom
output retains the class of the variables.
term
is a character vector, the other vectors are numeric.
When this output is dust
ed, however, these are all turned
into character values (but with a reference to its original class).
Don’t panic, though. This isn’t a disadvantage, it’s the key feature of
pixiedust
. dust
converts the
broom
output into a table where each cell in the table is
represented by a row (Take a look at dust(fit)$body
to see
what I mean). This is the process by which we get control over every
last detail of the table. By the time we’re done, we’ll easily produce
tables that look like this:
Term | Coefficient | SE | T-statistic | P-value |
---|---|---|---|---|
(Intercept) | 9.365 | 8.373 | 1.118 | 0.27 |
qsec | 1.245 | 0.383 | 3.252 | 0.003 |
factor(am)Manual | 3.151 | 1.941 | 1.624 | 0.12 |
wt | -3.926 | 0.743 | -5.286 | < 0.001 |
factor(gear)4 | -0.268 | 1.655 | -0.162 | 0.87 |
factor(gear)5 | -0.27 | 2.063 | -0.131 | 0.9 |
Okay, maybe not those exact colors. But you have to admit, they are very pixie like colors, are they not?
As we noted earlier, the default output of dust
has far
too many decimal places. In most cases, the decimal places returned
probably exceed the accuracy of the values in the data. We can sprinkle
the values with round
or any other function to suit our
needs. First, let’s take a look at the round
sprinkle.
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 9.37 | 8.37 | 1.12 | 0.2735903 |
qsec | 1.24 | 0.38 | 3.25 | 0.0031681 |
factor(am)Manual | 3.15 | 1.94 | 1.62 | 0.1165367 |
wt | -3.93 | 0.74 | -5.29 | 1.58e-05 |
factor(gear)4 | -0.27 | 1.66 | -0.16 | 0.8725685 |
factor(gear)5 | -0.27 | 2.06 | -0.13 | 0.896985 |
That already makes a big difference. We could have rounded the
p-values as well, but we’ll do something different with those. We’ll use
another function to format the p-values into strings. In the following
code, we’ll pass a function call to the fn
argument of
sprinkle
. There are two important aspects of this call to
be aware of
quote
.
sprinkles
uses standard evaluation, and passing a function
wrapped in quote
allows us to delay its execution.pvalString
is acting on
value
. The elements of the dust
object are
stored in a manner where each cell in the table is a row in a data
frame, with the contents of the cell being stored as value
.
(Try running dust(fit)$body
to explore the anatomy of the
dust
object). Any function you pass in the fn
argument needs to act on value
.dust(fit) %>%
sprinkle(cols = c("estimate", "std.error", "statistic"),
round = 3) %>%
sprinkle(cols = "p.value", fn = quote(pvalString(value)))
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 9.365 | 8.373 | 1.118 | 0.27 |
qsec | 1.245 | 0.383 | 3.252 | 0.003 |
factor(am)Manual | 3.151 | 1.941 | 1.624 | 0.12 |
wt | -3.926 | 0.743 | -5.286 | < 0.001 |
factor(gear)4 | -0.268 | 1.655 | -0.162 | 0.87 |
factor(gear)5 | -0.27 | 2.063 | -0.131 | 0.9 |
After formatting the cell values, the next thing we will likely want
to change about our table is the column names. The names returned by
broom
are deliberately generic. In a conference call in
July of 2015, a listener asked Robinson if using the column name
statisic
made sense for so many model types, since some
were F
statistics, some were t
and still
others were z
. Robinson answered that broom'
s
focus was not on the convenience of the reader, but on the convenience
of the analyst being able to quickly and easily combine the output of
several models. Having a generic name made it easier for the
analyst.
For the reader, the table’s column names can be modified using the
sprinkle_colnames
function in pixiedust
. The
function only has a ...
argument, and may accept either
named or unnamed arguments. If the arguments are named, the name matches
one of the column names in the broom
output, and the
argument value represents the name we wish to appear in print.
dust(fit) %>%
sprinkle(cols = c("estimate", "std.error", "statistic"),
round = 3) %>%
sprinkle(cols = "p.value", fn = quote(pvalString(value))) %>%
sprinkle_colnames(term = "Term", p.value = "P-value")
Term | estimate | std.error | statistic | P-value |
---|---|---|---|---|
(Intercept) | 9.365 | 8.373 | 1.118 | 0.27 |
qsec | 1.245 | 0.383 | 3.252 | 0.003 |
factor(am)Manual | 3.151 | 1.941 | 1.624 | 0.12 |
wt | -3.926 | 0.743 | -5.286 | < 0.001 |
factor(gear)4 | -0.268 | 1.655 | -0.162 | 0.87 |
factor(gear)5 | -0.27 | 2.063 | -0.131 | 0.9 |
Naming the arguments has advantages for reproducibility, as
pixiedust
will correctly assign the column names regardless
of order.
dust(fit) %>%
sprinkle(cols = c("estimate", "std.error", "statistic"),
round = 3) %>%
sprinkle(cols = "p.value", fn = quote(pvalString(value))) %>%
sprinkle_colnames(term = "Term", p.value = "P-value",
std.error = "SE", statistic = "T-statistic",
estimate = "Coefficient")
Term | Coefficient | SE | T-statistic | P-value |
---|---|---|---|---|
(Intercept) | 9.365 | 8.373 | 1.118 | 0.27 |
qsec | 1.245 | 0.383 | 3.252 | 0.003 |
factor(am)Manual | 3.151 | 1.941 | 1.624 | 0.12 |
wt | -3.926 | 0.743 | -5.286 | < 0.001 |
factor(gear)4 | -0.268 | 1.655 | -0.162 | 0.87 |
factor(gear)5 | -0.27 | 2.063 | -0.131 | 0.9 |
If all of the columns are to be renamed, we may forego naming the arguments so long as we are careful to provide the new names in the same order they appear in the table (from left to right). If the new names are provided in the wrong order, they will be applied to the table incorrectly. Thus, it is recommended to name the arguments.
dust(fit) %>%
sprinkle(cols = c("estimate", "std.error", "statistic"),
round = 3) %>%
sprinkle(cols = "p.value", fn = quote(pvalString(value))) %>%
sprinkle_colnames("Term", "Coefficient", "SE", "T-statistic", "P-value")
Term | Coefficient | SE | T-statistic | P-value |
---|---|---|---|---|
(Intercept) | 9.365 | 8.373 | 1.118 | 0.27 |
qsec | 1.245 | 0.383 | 3.252 | 0.003 |
factor(am)Manual | 3.151 | 1.941 | 1.624 | 0.12 |
wt | -3.926 | 0.743 | -5.286 | < 0.001 |
factor(gear)4 | -0.268 | 1.655 | -0.162 | 0.87 |
factor(gear)5 | -0.27 | 2.063 | -0.131 | 0.9 |
In the case that you provide a different number of arguments than there are columns in the table, an error is returned stating such.
dust(fit) %>%
sprinkle(cols = c("estimate", "std.error", "statistic"),
round = 3) %>%
sprinkle(cols = "p.value", fn = quote(pvalString(value))) %>%
sprinkle_colnames("Term", "Coefficient", "SE", "T-statistic", "P-value", "Extra Column Name")
## Error in `$<-.data.frame`(`*tmp*`, "value", value = c("Term", "Coefficient", : replacement has 6 rows, data has 5
There may be times you wish to use different values in the table than
what are provided by the broom
output. Some examples may be
using different standard errors from a ridge regression, or perhaps you
prefer to display the variance inflation factors instead of the p-value.
Values can be replaced using the replace
sprinkle. In this
example, we’ll replace the term
column with names that are
a bit more friendly to the reader.
dust(fit) %>%
sprinkle(cols = "term",
replace = c("Intercept", "Quarter Mile Time", "Automatic vs. Manual",
"Weight", "Gears: 4 vs. 3", "Gears: 5 vs 3")) %>%
sprinkle(cols = c("estimate", "std.error", "statistic"),
round = 3) %>%
sprinkle(cols = "p.value", fn = quote(pvalString(value))) %>%
sprinkle_colnames("Term", "Coefficient", "SE", "T-statistic", "P-value")
Term | Coefficient | SE | T-statistic | P-value |
---|---|---|---|---|
Intercept | 9.365 | 8.373 | 1.118 | 0.27 |
Quarter Mile Time | 1.245 | 0.383 | 3.252 | 0.003 |
Automatic vs. Manual | 3.151 | 1.941 | 1.624 | 0.12 |
Weight | -3.926 | 0.743 | -5.286 | < 0.001 |
Gears: 4 vs. 3 | -0.268 | 1.655 | -0.162 | 0.87 |
Gears: 5 vs 3 | -0.27 | 2.063 | -0.131 | 0.9 |
Values are always replaced down the column before across the row. To illustrate, let’s replace the cells in rows 2 - 3 and columns 3 - 4 with the values 100, 200, 300, and 400. If we want the values to read in sequential order from left to right before going to the next line, we make the replacement call (we will also italicize these cells to make them easier to find)
dust(fit) %>%
sprinkle(rows = 2:3, cols = 3:4,
replace = c(100, 300, 200, 400),
italic = TRUE) %>%
sprinkle(cols = c("estimate", "std.error", "statistic"),
round = 3) %>%
sprinkle(cols = "p.value", fn = quote(pvalString(value))) %>%
sprinkle_colnames("Term", "Coefficient", "SE", "T-statistic", "P-value")
Term | Coefficient | SE | T-statistic | P-value |
---|---|---|---|---|
(Intercept) | 9.365 | 8.373 | 1.118 | 0.27 |
qsec | 1.245 | 100 | 200 | 0.003 |
factor(am)Manual | 3.151 | 300 | 400 | 0.12 |
wt | -3.926 | 0.743 | -5.286 | < 0.001 |
factor(gear)4 | -0.268 | 1.655 | -0.162 | 0.87 |
factor(gear)5 | -0.27 | 2.063 | -0.131 | 0.9 |
For the duration of the vignette, we will use basetable
as the basis of additional customizations where basetable
is defined below. We are also moving out of the capabilities of the
console, so we will switch over to HTML printing.
basetable <- dust(fit) %>%
sprinkle(cols = c("estimate", "std.error", "statistic"),
round = 3) %>%
sprinkle(cols = "p.value", fn = quote(pvalString(value))) %>%
sprinkle_colnames(term = "Term", estimate = "Coefficient",
std.error = "SE", statistic = "T-statistic",
p.value = "P-value") %>%
sprinkle_print_method("html")
For no good reason, let’s also focus on drawing attention to the statistically significant results. Using borders, we could accomplish this by drawing a border around each of those rows. There are five sprinkles related to borders.
border
controls on which sides of the cells the borders
are drawn.border_thickness
controls how thick the borders
are.border_units
controls the units of measure on the
thickness.border_style
controls the border style (solid or
dashed, etc).border_color
controls the color of the border.All of these sprinkles have default values they can take, so unless we need to customize more than one sprinkle, we need only specify one of the five in order to get all of them to take effect.
Term | Coefficient | SE | T-statistic | P-value |
---|---|---|---|---|
(Intercept) | 9.365 | 8.373 | 1.118 | 0.27 |
qsec | 1.245 | 0.383 | 3.252 | 0.003 |
factor(am)Manual | 3.151 | 1.941 | 1.624 | 0.12 |
wt | -3.926 | 0.743 | -5.286 | < 0.001 |
factor(gear)4 | -0.268 | 1.655 | -0.162 | 0.87 |
factor(gear)5 | -0.27 | 2.063 | -0.131 | 0.9 |
If we want to eliminate the borders between cells, we have to do a little more work.
basetable %>%
sprinkle(rows = c(2, 4), cols = 1,
border = c("left", "top", "bottom"),
border_color = "orchid") %>%
sprinkle(rows = c(2, 4), cols = 5,
border = c("right", "top", "bottom"),
border_color = "orchid") %>%
sprinkle(rows = c(2, 4), cols = 2:4,
border = c("top", "bottom"),
border_color = "orchid")
Term | Coefficient | SE | T-statistic | P-value |
---|---|---|---|---|
(Intercept) | 9.365 | 8.373 | 1.118 | 0.27 |
qsec | 1.245 | 0.383 | 3.252 | 0.003 |
factor(am)Manual | 3.151 | 1.941 | 1.624 | 0.12 |
wt | -3.926 | 0.743 | -5.286 | < 0.001 |
factor(gear)4 | -0.268 | 1.655 | -0.162 | 0.87 |
factor(gear)5 | -0.27 | 2.063 | -0.131 | 0.9 |
We can further separate these rows by adding more padding to the cells. In this example, for simplicity, we’ll allow the lines between cells.
Term | Coefficient | SE | T-statistic | P-value |
---|---|---|---|---|
(Intercept) | 9.365 | 8.373 | 1.118 | 0.27 |
qsec | 1.245 | 0.383 | 3.252 | 0.003 |
factor(am)Manual | 3.151 | 1.941 | 1.624 | 0.12 |
wt | -3.926 | 0.743 | -5.286 | < 0.001 |
factor(gear)4 | -0.268 | 1.655 | -0.162 | 0.87 |
factor(gear)5 | -0.27 | 2.063 | -0.131 | 0.9 |
A more conventional way to draw attention to these rows would be to print them in bold text.
Term | Coefficient | SE | T-statistic | P-value |
---|---|---|---|---|
(Intercept) | 9.365 | 8.373 | 1.118 | 0.27 |
qsec | 1.245 | 0.383 | 3.252 | 0.003 |
factor(am)Manual | 3.151 | 1.941 | 1.624 | 0.12 |
wt | -3.926 | 0.743 | -5.286 | < 0.001 |
factor(gear)4 | -0.268 | 1.655 | -0.162 | 0.87 |
factor(gear)5 | -0.27 | 2.063 | -0.131 | 0.9 |
The text could also be italicized either separately or concurrently. He we show the italics printed concurrently.
Term | Coefficient | SE | T-statistic | P-value |
---|---|---|---|---|
(Intercept) | 9.365 | 8.373 | 1.118 | 0.27 |
qsec | 1.245 | 0.383 | 3.252 | 0.003 |
factor(am)Manual | 3.151 | 1.941 | 1.624 | 0.12 |
wt | -3.926 | 0.743 | -5.286 | < 0.001 |
factor(gear)4 | -0.268 | 1.655 | -0.162 | 0.87 |
factor(gear)5 | -0.27 | 2.063 | -0.131 | 0.9 |
Backgrounds are added using the bg
sprinkle, which
accepts X11 colors, hexidecimal colors, rgb colors, and for HTML rgba
colors (the a specifies the transparency). To put in a background in the
rows showing statistical significance, we need only specify the
color.
Term | Coefficient | SE | T-statistic | P-value |
---|---|---|---|---|
(Intercept) | 9.365 | 8.373 | 1.118 | 0.27 |
qsec | 1.245 | 0.383 | 3.252 | 0.003 |
factor(am)Manual | 3.151 | 1.941 | 1.624 | 0.12 |
wt | -3.926 | 0.743 | -5.286 | < 0.001 |
factor(gear)4 | -0.268 | 1.655 | -0.162 | 0.87 |
factor(gear)5 | -0.27 | 2.063 | -0.131 | 0.9 |
If we decide that color is a little bit strong, we can lighten it up a little with the transparency. We have to look up the rgb specification for the orchid color (there are lots of web resources for this; X11 Color Names on Wikipedia is a good place to start).
Term | Coefficient | SE | T-statistic | P-value |
---|---|---|---|---|
(Intercept) | 9.365 | 8.373 | 1.118 | 0.27 |
qsec | 1.245 | 0.383 | 3.252 | 0.003 |
factor(am)Manual | 3.151 | 1.941 | 1.624 | 0.12 |
wt | -3.926 | 0.743 | -5.286 | < 0.001 |
factor(gear)4 | -0.268 | 1.655 | -0.162 | 0.87 |
factor(gear)5 | -0.27 | 2.063 | -0.131 | 0.9 |
If we aren’t interested in coloring just those two rows, we can apply
color to the entire table with the bg_pattern
sprinkle.
This sprinkle accepts as many colors as you want to cycle through.
Term | Coefficient | SE | T-statistic | P-value |
---|---|---|---|---|
(Intercept) | 9.365 | 8.373 | 1.118 | 0.27 |
qsec | 1.245 | 0.383 | 3.252 | 0.003 |
factor(am)Manual | 3.151 | 1.941 | 1.624 | 0.12 |
wt | -3.926 | 0.743 | -5.286 | < 0.001 |
factor(gear)4 | -0.268 | 1.655 | -0.162 | 0.87 |
factor(gear)5 | -0.27 | 2.063 | -0.131 | 0.9 |
Font sizes and colors are modified with the font_size
and font_color
sprinkles. We’ll employ these simultaneously
to highlight our significant rows.
basetable %>%
sprinkle(rows = c(2, 4),
font_color = "orchid",
font_size = 24,
font_size_units = "pt")
Term | Coefficient | SE | T-statistic | P-value |
---|---|---|---|---|
(Intercept) | 9.365 | 8.373 | 1.118 | 0.27 |
qsec | 1.245 | 0.383 | 3.252 | 0.003 |
factor(am)Manual | 3.151 | 1.941 | 1.624 | 0.12 |
wt | -3.926 | 0.743 | -5.286 | < 0.001 |
factor(gear)4 | -0.268 | 1.655 | -0.162 | 0.87 |
factor(gear)5 | -0.27 | 2.063 | -0.131 | 0.9 |
(Woah! That was a bit too much.)
In addition to the sprinkles already discussed, we can also use sprinkles to change the height, width, and alignment of cells. For illustration, we’re going to use the first three rows of columns 2-4 to show a grid of all the combinations of alignments. This requires that each cell be modified individually, so bear with me…the code is a bit long.
basetable %>%
sprinkle(rows = 1, cols = 2, halign = "left", valign = "top", height = 50, width = 50) %>%
sprinkle(rows = 1, cols = 3, halign = "center", valign = "top", height = 50, width = 50) %>%
sprinkle(rows = 1, cols = 4, halign = "right", valign = "top", height = 50, width = 50) %>%
sprinkle(rows = 2, cols = 2, halign = "left", valign = "middle", height = 50, width = 50) %>%
sprinkle(rows = 2, cols = 3, halign = "center", valign = "middle", height = 50, width = 50) %>%
sprinkle(rows = 2, cols = 4, halign = "right", valign = "middle", height = 50, width = 50) %>%
sprinkle(rows = 3, cols = 2, halign = "left", valign = "bottom", height = 50, width = 50) %>%
sprinkle(rows = 3, cols = 3, halign = "center", valign = "bottom", height = 50, width = 50) %>%
sprinkle(rows = 3, cols = 4, halign = "right", valign = "bottom", height = 50, width = 50)
Term | Coefficient | SE | T-statistic | P-value |
---|---|---|---|---|
(Intercept) | 9.365 | 8.373 | 1.118 | 0.27 |
qsec | 1.245 | 0.383 | 3.252 | 0.003 |
factor(am)Manual | 3.151 | 1.941 | 1.624 | 0.12 |
wt | -3.926 | 0.743 | -5.286 | < 0.001 |
factor(gear)4 | -0.268 | 1.655 | -0.162 | 0.87 |
factor(gear)5 | -0.27 | 2.063 | -0.131 | 0.9 |
There is a sprinkle available to rotate the text in a cell. I don’t
recommend using it. Rotated text is harder to read, and communicating
concepts is the whole point of the table. However, sometimes it might be
necessary. For our example, we’ll use the first few rows of the
mtcars
data set.
Notice here that when I apply the rotation, I added an argument to
sprinkle
in which I denoted that the rotation should apply
to the head of the table. The head and body of the table are stored
separately in the dust
object and all of the sprinkles may
be applied to either part of the table.
dust(Formaldehyde) %>%
sprinkle(cols = c("mpg", "disp", "drat", "qsec"),
round = 2) %>%
sprinkle(rows = 1,
rotate_degree = -90,
height = 60,
part = "head") %>%
sprinkle_print_method("html")
carb | optden |
---|---|
0.1 | 0.086 |
0.3 | 0.269 |
0.5 | 0.446 |
0.6 | 0.538 |
0.7 | 0.626 |
0.9 | 0.782 |
broom
: An R Package for Converting
Statistical Analysis Objects Into Tidy Data Frames,” Cornell University
Library, https://arxiv.org/pdf/1412.3565v2.pdf.