ggvis is an awesome data visualization package which builds data graphics with a syntax similar to
ggplot2 and creates rich interactive plots like
shiny. Since the syntax is very structural, it’s easy to learn and to use.
ggvis recreates the grammar of graphics. The key syntax is like this:
You could find 4 components from the chunk above:
$$ Graphic = Data + CoordinateSystem + Properties + Marks $$
For example, using built-in dataset
Noticed that the coordinates, properties can be moved to the
ggvis() can generates plot without
layer_<marks>(). Those will all be concretely introduced in the following part.
ggvis allows multiple layers overlaid. When you put the coordinates and properties in
ggvis(), you declare them globally. That means the coordinates and properties will be used commonly in all the following
~hp, ~mpg, stroke := "blue" in
ggvis(), they are applied on all the layers: respectively on
layer_smooths() for the color of border of the points and the color of the smooth line. By default the
fill color of points is black.
When we do that locally, we put the properties in each layer:
The properties are declared locally and both layers use
stroke but the property works sperately in 2 layers and
fill has no impact on
layer_smooths(). Why we keep
~hp, ~mpg in
ggvis()? Because the program doesn’t have to run them twice in each layer. Keeping them in
ggvis() makes it more efficient.
The most important symbols are
:=. You can note them as “mapping” and “setting”.
There are 2 spaces when plotting something - a data space and a visualization space. For example the color have HTML color codes, RGB color codes, etc. If you provide a variable to specify the
= (normally followed by a tilde
ggvis to treat it as a variable),
ggvis will mapping the variable value on color scales first before plotting.
If you directly pass a string with quotation mark to it,
ggvis read it as a raw value.
settings vs. mapping only works for a property instead of a parameter.
You could directly use
= + values for a parameter.
%>% is based on package
magrittr and is used widely in
dplyr. It’s a symbol of chaining and makes the program more readable.
If you don’t specify layer type,
ggvis will use
layer_guess() to give an approximate estimation (
?ggvis::layer_guess). Besides the magic
layer_guess(), I’d strongly recommand to learn more specified layers. We just show you 2 kinds of properties in previous. The basic layers and properties are as below (column names are
layer_<marks> functions, row names are properties):
|x / x2||O||O||O||O||O||O||O||O|
|y / y2||O||O||O||O||O||O||O||O|
O means supported,
X means not supported.
For bar graphs of counts at each unique x value, in contrast to a histogram’s bins along x ranges. Barchar and histogram both have
width argument. However, the former one is used as column width in graphical space, the latter one to group the coutinuous data on x-axis.
Frequency polygon treats the continuous data in the same logic as histogram but use a line to describe the frequency evolution across ranges. Notice that I use
fillOpacity instead of
opacity in the third plot. That means the transparency effect is not applied on the stroke (not applied on every layer). By default there is nothing filled under the curve of frequency polygon, since we specify it, the region is filled by transparent red.
width is also a parameter of
layer_boxplots(), the default value is 0.9. This parameter specify the distance among groups / the width of boxes. Besides the normal properties
stroke, etc., you can assign the
size of outliers.
layer_boxplots() seems to have a bit problems under
ggvis (version 0.4.2) when we modify the value of
size - the mustach move but the boxes don’t. Waiting for the package update.
Density plots provide another way to display the distribution of a single variable. A density plot uses a line to display the density of a variable at each point in its range. You can think of a density plot as a continuous version of a histogram with a different y scale (although this is not exactly accurate).
You can specify the
area parameter to decide whether there should be a shaded region drawn under the curve. In the chunk above, even you assign “red” to
fill, there is nothing under the density curve (the dault setting is to draw a grey shadow).
Firstly we compare these 2 layers:
It seems that the
layer_paths() is chaos, but it is not. It plots starting from the very first record until the last. Let’s reorder the dataset.
layer_paths() have same trend as
layer_paths() is more powerful on geographical plots.
lines and paths plot can also use
We’ve been talking about
layer_points() for many times. To be noticed that there are many values for the shape of points:
cyl from numeric to categorical data and that makes the plot more clear.
layer_model_predictions() fits a model to the data and draw it with
layer_paths() and, optionally,
layer_smooths() is a special case of layering model predictions where the model is a smooth
loess curve whose smoothness is controlled by the span parameter. Both use same properties as
If you don’t specify the
formula argument in
ggvis will guess it based on the input in the global data space of
There are more properties and
layer_<marks>(), you could make a research by typing
Some layers can be realized in another way. Consider how is the
layer_model_predictions() processed. 1. Estimate the model based on the dataset; 2. Compute predicted data; 3. Plot the predicted line.
compute_model_prediction() realizes the first 2 steps.
The function returns 2 columns
The chunk returns the same line as the
layer_model_predictions() did before.
layer_densities() can be splitted to 2 steps in the same way.
And you can easily find the corresponding relations as below:
compute_bin() returns 5 columns where
count_ is the count in the each interval,
xmax_ are the left-most and right-most border of the interval. We can use these 3 columns to plot the rectangles in the intervals.
resp_ - concatenate them by
layer_lines() brings the exact density line.
compute_tabulate() only returns 2 columns -
x_ at which point what’s the level of corresponding data
compute_align() helps to set up the lower and upper bound of interval.
length limits the length of an interval. It’s equivalent to the
compute_stack() generates 3 new variables based on data source:
stack_lwr_. The latter 2 variables indicate the upper and lower y coordinates of each stack.
Try another classic stacked barchart:
The compound way to realize a layer helps to understand how the high level layer is generated, and gives you a way to grab the key data for plotting.
ggvis can work along with
dplyr package. For example to recreate the plot above:
group_by() also works for multiple variables.
In this article, all interactivity is disabled, only screenshots are provided
ggvis enables interactive HTML output by a series of
input_<form_controls>() functions. There are 7 input widgets:
input_checkbox()creates an interactive checkbox;
create interactive control to select on or more options from a list;
input_slider()creates an interactive slider;
input_text()create an interactive numeric or text input box.
You can also split the plotting steps if the input clause is too long.
input_checkboxgroup() allows multiple choice.
input_radiobuttons() allows only one option to be picked up in the list.
map should be function with one single argument and returns a modified value based on this funciton. When you maps the variable name to a property, remember to use
= instead of
A slider is quite useful for controlling the argument of continuous data, for example, control the binwidth of a histogram:
To control the binwidth, you could also directly assign a value by
To control the fill color, you could use
add_axis() to adjust the axis.
The first argument specify horizontal or vertical axis. Use
title to name the axes. You can add more details:
Compare carefully the difference between the 2 plots and you will find what do those arguments serve for.
add_legend() works similarly to
add_axis(), except that it alters the legend of a plot. Instead of specifying which axis to change, you have to specify the property you want to add to the legend. For example:
ggvis will create a separate legend for each property that you use. To do this, you just need to feed
add_legend() a vector of property names as its first argument. The code below creates legend for 3 properties:
ggvis provides several different functions for creating scales:
scale_singular(). Each maps a different type of data input to the visual properties that
The chunk above maps the value of
disp on the scale range between
fill color, between
The chunk below maps a categorical variable to fill.
cyl has 3 unique values so we can provide a range of length 3 with the color names.
You can adjust any visual property in your graph with a scale (not just color). For example you can specify the opacity and the domain of axes.
scale_numeric("y", domain = c(0, NA)) means there is no limit on the maximum value on the y-axis.
Be aware: ggvis interactivity cannot be displayed in HTML file converted from