Why ggvis?
ggvis
is an awesome data visualization package which builds data graphics with a syntax similar to ggplot2
and creates rich interactive plots like shiny
. Since the syntax is very structural, it’s easy to learn and to use.
|
|
Basic ggvis Syntax
Basic Components
ggvis
recreates the grammar of graphics. The key syntax is like this:
|
|
You could find 4 components from the chunk above:
$$ Graphic = Data + CoordinateSystem + Properties + Marks $$
For example, using built-in dataset mtcars
:
|
|
Noticed that the coordinates, properties can be moved to the layer_<marks>()
, ggvis()
can generates plot without layer_<marks>()
. Those will all be concretely introduced in the following part.
Global vs. Local Declaration & Multiple Layers
ggvis
allows multiple layers overlaid. When you put the coordinates and properties in ggvis()
, you declare them globally. That means the coordinates and properties will be used commonly in all the following layer_<marks>()
|
|
We specify ~hp, ~mpg, stroke := "blue"
in ggvis()
, they are applied on all the layers: respectively on layer_points()
and layer_smooths()
for the color of border of the points and the color of the smooth line. By default the fill
color of points is black.
When we do that locally, we put the properties in each layer:
|
|
The properties are declared locally and both layers use stroke
but the property works sperately in 2 layers and fill
has no impact on layer_smooths()
. Why we keep ~hp, ~mpg
in ggvis()
? Because the program doesn’t have to run them twice in each layer. Keeping them in ggvis()
makes it more efficient.
Assignment Symbols =
& :=
The most important symbols are =
and :=
. You can note them as “mapping” and “setting”.
There are 2 spaces when plotting something - a data space and a visualization space. For example the color have HTML color codes, RGB color codes, etc. If you provide a variable to specify the fill
using =
(normally followed by a tilde ~
, making ggvis
to treat it as a variable), ggvis
will mapping the variable value on color scales first before plotting.
|
|
If you directly pass a string with quotation mark to it, ggvis
read it as a raw value.
|
|
settings vs. mapping only works for a property instead of a parameter.
You could directly use =
+ values for a parameter.
Besides, %>%
is based on package magrittr
and is used widely in dplyr
. It’s a symbol of chaining and makes the program more readable.
Layers & Properties
If you don’t specify layer type, ggvis
will use layer_guess()
to give an approximate estimation (?ggvis::layer_guess
). Besides the magic layer_guess()
, I’d strongly recommand to learn more specified layers. We just show you 2 kinds of properties in previous. The basic layers and properties are as below (column names are layer_<marks>
functions, row names are properties):
layer_ | bars | boxplots | densities | freqpolys | histograms | lines | paths | points |
---|---|---|---|---|---|---|---|---|
x / x2 | O | O | O | O | O | O | O | O |
y / y2 | O | O | O | O | O | O | O | O |
width | O | O | X | O | O | X | X | O |
opacity | O | O | O | O | O | O | O | O |
fill | O | O | O | O | O | O | O | O |
fillOpacity | O | O | O | O | O | O | O | O |
stroke | O | O | O | O | O | O | O | O |
strokeWidth | O | O | O | O | O | O | O | O |
strokeOpacity | O | O | O | O | O | O | O | O |
size | X | O | X | X | X | X | X | O |
shape | X | X | X | X | X | X | X | O |
Where O
means supported, X
means not supported.
Barchart, Histogram & Frequency Polygon
For bar graphs of counts at each unique x value, in contrast to a histogram’s bins along x ranges. Barchar and histogram both have width
argument. However, the former one is used as column width in graphical space, the latter one to group the coutinuous data on x-axis.
|
|
|
|
|
|
Frequency polygon treats the continuous data in the same logic as histogram but use a line to describe the frequency evolution across ranges. Notice that I use fillOpacity
instead of opacity
in the third plot. That means the transparency effect is not applied on the stroke (not applied on every layer). By default there is nothing filled under the curve of frequency polygon, since we specify it, the region is filled by transparent red.
Boxplot
width
is also a parameter of layer_boxplots()
, the default value is 0.9. This parameter specify the distance among groups / the width of boxes. Besides the normal properties fill
, stroke
, etc., you can assign the size
of outliers.
|
|
Currently the layer_boxplots()
seems to have a bit problems under ggvis
(version 0.4.2) when we modify the value of size
- the mustach move but the boxes don’t. Waiting for the package update.
Density Plot
Density plots provide another way to display the distribution of a single variable. A density plot uses a line to display the density of a variable at each point in its range. You can think of a density plot as a continuous version of a histogram with a different y scale (although this is not exactly accurate).
|
|
You can specify the area
parameter to decide whether there should be a shaded region drawn under the curve. In the chunk above, even you assign “red” to fill
, there is nothing under the density curve (the dault setting is to draw a grey shadow).
Lines & Paths
Firstly we compare these 2 layers:
|
|
It seems that the layer_paths()
is chaos, but it is not. It plots starting from the very first record until the last. Let’s reorder the dataset.
|
|
Now layer_paths()
have same trend as layer_lines()
. layer_paths()
is more powerful on geographical plots.
|
|
lines and paths plot can also use fill
property.
|
|
Scatterplot
We’ve been talking about layer_points()
for many times. To be noticed that there are many values for the shape of points: "circle"
(default), "square"
, "diamond"
, "cross"
, "triangle-up"
and "triangle-down"
.
|
|
factor()
converts cyl
from numeric to categorical data and that makes the plot more clear.
Special Layers: Model Prediction & Smooths
layer_model_predictions()
fits a model to the data and draw it with layer_paths()
and, optionally, layer_ribbons()
. layer_smooths()
is a special case of layering model predictions where the model is a smooth loess
curve whose smoothness is controlled by the span parameter. Both use same properties as layer_paths()
and layer_lines()
.
|
|
|
|
|
|
If you don’t specify the formula
argument in layer_model_predictions()
, ggvis
will guess it based on the input in the global data space of ggvis()
.
There are more properties and layer_<marks>()
, you could make a research by typing ?ggvis::marks
and ??ggvis::layer_
.
Layers Equivalence
Model Prediction, Smooths & Densities
Some layers can be realized in another way. Consider how is the layer_model_predictions()
processed. 1. Estimate the model based on the dataset; 2. Compute predicted data; 3. Plot the predicted line.
In ggvis
, compute_model_prediction()
realizes the first 2 steps.
|
|
The function returns 2 columns pred_
and resp_
.
|
|
The chunk returns the same line as the layer_model_predictions()
did before. layer_smooths()
and layer_densities()
can be splitted to 2 steps in the same way.
And you can easily find the corresponding relations as below:
layer_ | compute_ | layer_ |
---|---|---|
model_predictions | model_predictions | paths |
smooths | smooths | paths |
histograms | bin | rects |
densities | density | lines |
bar | count/tabulate/stack/align | rects |
boxplots | boxplot/stack | rects |
Equivalence of Histograms
compute_bin()
returns 5 columns where count_
is the count in the each interval, xmin_
and xmax_
are the left-most and right-most border of the interval. We can use these 3 columns to plot the rectangles in the intervals.
|
|
Equivalence of Density Plot
compute_density()
returns pred_
and resp_
- concatenate them by layer_lines()
brings the exact density line.
|
|
Equivalence of Barchart
compute_count()
and compute_tabulate()
only returns 2 columns - x_
at which point what’s the level of corresponding data count_
. And compute_align()
helps to set up the lower and upper bound of interval.
|
|
In compute_align()
, length
limits the length of an interval. It’s equivalent to the width
in layer_bars()
How does compute_stack()
works?
|
|
compute_stack()
generates 3 new variables based on data source: group__
, stack_upr_
and stack_lwr_
. The latter 2 variables indicate the upper and lower y coordinates of each stack.
Try another classic stacked barchart:
|
|
The compound way to realize a layer helps to understand how the high level layer is generated, and gives you a way to grab the key data for plotting.
Working with dplyr
Package
ggvis
can work along with dplyr
package. For example to recreate the plot above:
|
|
group_by()
also works for multiple variables.
|
|
Interactive Output
In this article, all interactivity is disabled, only screenshots are provided
ggvis
enables interactive HTML output by a series of input_<form_controls>()
functions. There are 7 input widgets:
input_checkbox()
creates an interactive checkbox;input_select()
,input_checkboxgroup()
andinput_radiobuttons()
create interactive control to select on or more options from a list;input_slider()
creates an interactive slider;input_numeric()
;input_text()
create an interactive numeric or text input box.
Single Checkbox
|
|
You can also split the plotting steps if the input clause is too long.
|
|
Selection from A List
input_select()
and input_checkboxgroup()
allows multiple choice. input_radiobuttons()
allows only one option to be picked up in the list.
|
|
|
|
|
|
|
|
The argument map
should be function with one single argument and returns a modified value based on this funciton. When you maps the variable name to a property, remember to use=
instead of :=
Slider
A slider is quite useful for controlling the argument of continuous data, for example, control the binwidth of a histogram:
|
|
Numeric & Text Input Box
To control the binwidth, you could also directly assign a value by select_numeric()
.
|
|
To control the fill color, you could use select_text()
.
|
|
Axes, Legends & Scales
Axes
We use add_axis()
to adjust the axis.
|
|
The first argument specify horizontal or vertical axis. Use title
to name the axes. You can add more details:
|
|
Compare carefully the difference between the 2 plots and you will find what do those arguments serve for.
Legends
add_legend()
works similarly to add_axis()
, except that it alters the legend of a plot. Instead of specifying which axis to change, you have to specify the property you want to add to the legend. For example:
|
|
ggvis
will create a separate legend for each property that you use. To do this, you just need to feed add_legend()
a vector of property names as its first argument. The code below creates legend for 3 properties: fill
, shape
and size
.
|
|
Scales
ggvis
provides several different functions for creating scales: scale_datetime()
, scale_logical()
, scale_nominal()
, scale_numeric()
, scale_singular()
. Each maps a different type of data input to the visual properties that ggvis
uses.
|
|
The chunk above maps the value of disp
on the scale range between red
and yellow
for fill
color, between darkred
and orange
for stroke
color.
The chunk below maps a categorical variable to fill. cyl
has 3 unique values so we can provide a range of length 3 with the color names.
|
|
You can adjust any visual property in your graph with a scale (not just color). For example you can specify the opacity and the domain of axes.
|
|
|
|
scale_numeric("y", domain = c(0, NA))
means there is no limit on the maximum value on the y-axis.
Be aware: ggvis interactivity cannot be displayed in HTML file converted from .Rmd
by knitr
.