Welcome to the GGally documentation page.

The following topic sections are alphabetically sorted.

The purpose of this function is to quickly plot the coefficients of a model.

To work automatically, this function requires the `broom`

package. Simply call `ggcoef`

with a model object. It could be the result of `lm`

, `glm`

or any other model covered by `broom`

and its `tidy`

method^{1}.

```
reg <- lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, data = iris)
ggcoef(reg)
```

`Loading required package: broom`

In the case of a logistic regression (or any other model for which coefficients are usually exponentiated), simply indicated `exponentiate = TRUE`

. Note that a logarithmic scale will be used for the x-axis.

```
d <- as.data.frame(Titanic)
log.reg <- glm(Survived ~ Sex + Age + Class, family = binomial, data = d, weights = d$Freq)
ggcoef(log.reg, exponentiate = TRUE)
```

You can use `conf.int`

, `vline`

and `exclude_intercept`

to display or not confidence intervals as error bars, a vertical line for `x = 0`

(or `x = 1`

if coeffcients are exponentiated) and the intercept.

`ggcoef(reg, vline = FALSE, conf.int = FALSE, exclude_intercept = TRUE)`

See the help page of `ggcoef`

for the full list of arguments that could be used to personalize how error bars and the vertical line are plotted.

```
ggcoef(
log.reg,
exponentiate = TRUE,
vline_color = "red",
vline_linetype = "solid",
errorbar_color = "blue",
errorbar_height = .25
)
```

Additional parameters will be passed to `geom_point`

.

`ggcoef(log.reg, exponentiate = TRUE, color = "purple", size = 5, shape = 18)`

Finally, you can also customize the aesthetic mapping of the points.

```
library(ggplot2)
ggcoef(log.reg, exponentiate = TRUE, mapping = aes(x = estimate, y = term, size = p.value)) +
scale_size_continuous(trans = "reverse")
```

You can also pass a custom data frame to `ggcoef`

. The following variables are expected:

`term`

(except if you customize the mapping)`estimate`

(except if you customize the mapping)`conf.low`

and`conf.high`

(only if you want to display error bars)

```
cust <- data.frame(
term = c("male vs. female", "30-49 vs. 18-29", "50+ vs. 18-29", "urban vs. rural"),
estimate = c(.456, 1.234, 1.897, 1.003),
conf.low = c(.411, 1.042, 1.765, 0.678),
conf.high = c(.498, 1.564, 2.034, 1.476),
variable = c("sex", "age", "age", "residence")
)
cust$term <- factor(cust$term, cust$term)
ggcoef(cust, exponentiate = TRUE)
```

```
ggcoef(
cust,
exponentiate = TRUE,
mapping = aes(x = estimate, y = term, colour = variable),
size = 5
)
```

The purpose of this function is to display two grouped data in a plot matrix. This is useful for canonical correlation analysis, multiple time series analysis, and regression analysis.

This example is derived from

`R Data Analysis Examples | Canonical Correlation Analysis. UCLA: Institute for Digital Research and Education. from http://www.stats.idre.ucla.edu/r/dae/canonical-correlation-analysis (accessed May 22, 2017).`

`Example 1. A researcher has collected data on three psychological variables, four academic variables (standardized test scores) and gender for 600 college freshman. She is interested in how the set of psychological variables relates to the academic variables and gender. In particular, the researcher is interested in how many dimensions (canonical variables) are necessary to understand the association between the two sets of variables."`

```
data(psychademic)
str(psychademic)
```

```
'data.frame': 600 obs. of 8 variables:
$ locus_of_control: num -0.84 -0.38 0.89 0.71 -0.64 1.11 0.06 -0.91 0.45 0 ...
$ self_concept : num -0.24 -0.47 0.59 0.28 0.03 0.9 0.03 -0.59 0.03 0.03 ...
$ motivation : chr "4" "3" "3" "3" ...
$ read : num 54.8 62.7 60.6 62.7 41.6 62.7 41.6 44.2 62.7 62.7 ...
$ write : num 64.5 43.7 56.7 56.7 46.3 64.5 39.1 39.1 51.5 64.5 ...
$ math : num 44.5 44.7 70.5 54.7 38.4 61.4 56.3 46.3 54.4 38.3 ...
$ science : num 52.6 52.6 58 58 36.3 58 45 36.3 49.8 55.8 ...
$ sex : chr "female" "female" "male" "male" ...
- attr(*, "academic")= chr "read" "write" "math" "science" ...
- attr(*, "psychology")= chr "locus_of_control" "self_concept" "motivation"
```

`(psych_variables <- attr(psychademic, "psychology"))`

`[1] "locus_of_control" "self_concept" "motivation" `

`(academic_variables <- attr(psychademic, "academic"))`

`[1] "read" "write" "math" "science" "sex" `

First, look at the within correlation using `ggpairs`

.

`ggpairs(psychademic, psych_variables, title = "Within Psychological Variables")`

```
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
```

`ggpairs(psychademic, academic_variables, title = "Within Academic Variables")`

```
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
```

Next, look at the between correlation using `ggduo`

.

```
ggduo(
psychademic, psych_variables, academic_variables,
types = list(continuous = "smooth_lm"),
title = "Between Academic and Psychological Variable Correlation",
xlab = "Psychological",
ylab = "Academic"
)
```

```
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
```

Since ggduo does not have a upper section to display the correlation values, we may use a custom function to add the information in the continuous plots. The strips may be removed as each group name may be recovered in the outer axis labels.

```
lm_with_cor <- function(data, mapping, ..., method = "pearson") {
x <- eval(mapping$x, data)
y <- eval(mapping$y, data)
cor <- cor(x, y, method = method)
ggally_smooth_lm(data, mapping, ...) +
ggplot2::geom_label(
data = data.frame(
x = min(x, na.rm = TRUE),
y = max(y, na.rm = TRUE),
lab = round(cor, digits = 3)
),
mapping = ggplot2::aes(x = x, y = y, label = lab),
hjust = 0, vjust = 1,
size = 5, fontface = "bold",
inherit.aes = FALSE # do not inherit anything from the ...
)
}
ggduo(
psychademic, rev(psych_variables), academic_variables,
mapping = aes(color = sex),
types = list(continuous = wrap(lm_with_cor, alpha = 0.25)),
showStrips = FALSE,
title = "Between Academic and Psychological Variable Correlation",
xlab = "Psychological",
ylab = "Academic",
legend = c(5,2)
) +
theme(legend.position = "bottom")
```

```
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
```

While displaying multiple time series vertically over time, such as `+ facet_grid(time ~ .)`

, `ggduo`

can handle both continuous and discrete data. `ggplot2`

does not mix discrete and continuous data on the same axis.

```
library(ggplot2)
data(pigs)
pigs_dt <- pigs[-(2:3)] # remove year and quarter
pigs_dt$profit_group <- as.numeric(pigs_dt$profit > mean(pigs_dt$profit))
qplot(
time, value,
data = reshape::melt.data.frame(pigs_dt, "time"),
geom = c("smooth", "point")
) +
facet_grid(variable ~ ., scales = "free_y")
```

``geom_smooth()` using method = 'loess'`

Instead, we may use `ggts`

to display the data. `ggts`

changes the default behavior of ggduo of `columnLabelsX`

to equal `NULL`

and allows for mixed variable types.

```
# make the profit group as a factor value
profit_groups <- c(
"1" = "high",
"0" = "low"
)
pigs_dt$profit_group <- factor(
profit_groups[as.character(pigs_dt$profit_group)],
levels = unname(profit_groups),
ordered = TRUE
)
ggts(pigs_dt, "time", 2:7)
```

``stat_bin()` using `bins = 30`. Pick better value with `binwidth`.`

```
# remove the binwidth warning
pigs_types <- list(
comboHorizontal = wrap(ggally_facethist, binwidth = 1)
)
ggts(pigs_dt, "time", 2:7, types = pigs_types)
```

```
# add color and legend
pigs_mapping <- aes(color = profit_group)
ggts(pigs_dt, pigs_mapping, "time", 2:7, types = pigs_types, legend = c(6,1))
```

Produce more meaningful labels, add a legend, and remove profit group strips.

```
pm <- ggts(
pigs_dt, pigs_mapping,
1, 2:7,
types = pigs_types,
legend = c(6,1),
columnLabelsY = c(
"number of\nfirst birth sows",
"sell price over\nfeed cost",
"sell count over\nheard size",
"meat head count",
"breading\nheard size",
"profit\ngroup"
),
showStrips = FALSE
) +
labs(fill = "profit group") +
theme(
legend.position = "bottom",
strip.background = element_rect(
fill = "transparent", color = "grey80"
)
)
pm
```

Since `ggduo`

may take custom functions just like `ggpairs`

, we will make a custom function that displays the residuals with a red line at 0 and all other y variables will receive a simple linear regression plot.

Note: the marginal residuals are calculated before plotting and the y_range is found to display all residuals on the same scale.

```
swiss <- datasets::swiss
# add a 'fake' column
swiss$Residual <- seq_len(nrow(swiss))
# calculate all residuals prior to display
residuals <- lapply(swiss[2:6], function(x) {
summary(lm(Fertility ~ x, data = swiss))$residuals
})
# calculate a consistent y range for all residuals
y_range <- range(unlist(residuals))
# custom function to display continuous data. If the y variable is "Residual", do custom work.
lm_or_resid <- function(data, mapping, ..., line_color = "red", line_size = 1) {
if (as.character(mapping$y) != "Residual") {
return(ggally_smooth_lm(data, mapping, ...))
}
# make residual data to display
resid_data <- data.frame(
x = data[[as.character(mapping$x)]],
y = residuals[[as.character(mapping$x)]]
)
ggplot(data = data, mapping = mapping) +
geom_hline(yintercept = 0, color = line_color, size = line_size) +
ylim(y_range) +
geom_point(data = resid_data, mapping = aes(x = x, y = y), ...)
}
# plot the data
ggduo(
swiss,
2:6, c(1,7),
types = list(continuous = lm_or_resid)
)
```

```
# change line to be thicker and blue and the points to be slightly transparent
ggduo(
swiss,
2:6, c(1,7),
types = list(
continuous = wrap(lm_or_resid,
alpha = 0.7,
line_color = "blue",
line_size = 3
)
)
)
```

This function rearranges data to be able to construct a glyph plot

```
library(ggplot2)
data(nasa)
temp.gly <- glyphs(nasa, "long", "day", "lat", "surftemp", height=2.5)
```

`Using width 2.38`

```
ggplot(temp.gly, ggplot2::aes(gx, gy, group = gid)) +
add_ref_lines(temp.gly, color = "grey90") +
add_ref_boxes(temp.gly, color = "grey90") +
geom_path() +
theme_bw() +
labs(x = "", y = "")
```

This shows a glyphplot of monthly surface temperature for 6 years over Central America. You can see differences from one location to another, that in large areas temperature doesn’t change much. There are large seasonal trends in the top left over land.

Rescaling in different ways puts emphasis on different components, see the examples in the referenced paper. And with ggplot2 you can make a map of the geographic area underlying the glyphs.

Wickham, H., Hofmann, H., Wickham, C. and Cook, D. (2012) Glyph-maps for Visually Exploring Temporal Patterns in Climate Data and Models, **Environmetrics**, *23*(5):151-182.

`ggmatrix`

is a function for managing multiple plots in a matrix-like layout. It was designed to adapt to any number of columns and rows. This allows for very customized plot matrices.

The examples below use plots labeled 1 to 6 to distinguish where the plots are being placed.

```
plotList <- list()
for (i in 1:6) {
plotList[[i]] <- ggally_text(paste("Plot #", i, sep = ""))
}
# bare minimum of plotList, nrow, and ncol
pm <- ggmatrix(plotList, 2, 3)
pm
```

```
# provide more information
pm <- ggmatrix(
plotList,
nrow = 2, ncol = 3,
xAxisLabels = c("A", "B", "C"),
yAxisLabels = c("D", "E"),
title = "Matrix Title"
)
pm
```

```
# display plots in column order
pm <- ggmatrix(
plotList,
nrow = 2, ncol = 3,
xAxisLabels = c("A", "B", "C"),
yAxisLabels = c("D", "E"),
title = "Matrix Title",
byrow = FALSE
)
pm
```

Individual plots may be retrieved from the plot matrix and can be placed in the plot matrix.

```
pm <- ggmatrix(
plotList,
nrow = 2, ncol = 3,
xAxisLabels = c("A", "B", "C"),
yAxisLabels = c("D", "E"),
title = "Matrix Title"
)
pm
```

```
p2 <- pm[1,2]
p3 <- pm[1,3]
p2
```

`p3`

```
pm[1,2] <- p3
pm[1,3] <- p2
pm
```

```
library(ggplot2)
pm <- ggmatrix(
plotList,
nrow = 2, ncol = 3,
xAxisLabels = c("A", "B", "C"),
yAxisLabels = c("D", "E"),
title = "Matrix Title",
byrow = FALSE
)
pm <- pm + theme_bw()
pm
```

The X and Y axis have booleans to turn on/off the individual plot’s axes on the bottom and left sides of the plot matrix. To save time, `showAxisPlotLabels`

can be set to override `showXAxisPlotLabels`

and `showYAxisPlotLabels`

.

```
pm <- ggmatrix(
plotList, nrow = 2, ncol = 3,
xAxisLabels = c("A", "B", "C"),
yAxisLabels = c("D", "E"),
title = "No Left Plot Axis",
showYAxisPlotLabels = FALSE
)
pm
```

```
pm <- ggmatrix(
plotList, nrow = 2, ncol = 3,
xAxisLabels = c("A", "B", "C"),
yAxisLabels = c("D", "E"),
title = "No Bottom Plot Axis",
showXAxisPlotLabels = FALSE
)
pm
```

```
pm <- ggmatrix(
plotList, nrow = 2, ncol = 3,
xAxisLabels = c("A", "B", "C"),
yAxisLabels = c("D", "E"),
title = "No Plot Axes",
showAxisPlotLabels = FALSE
)
pm
```

By default, the plots in the top row and the right most column will display top-side and right-side strips respectively (`showStrips = NULL`

). If all strips need to appear in each plot, `showStrips`

may be set to `TRUE`

. If all strips should not be displayed, `showStrips`

may be set to `FALSE`

.

```
data(tips, package = "reshape")
plotList <- list(
qplot(total_bill, tip, data = subset(tips, smoker == "No" & sex == "Female")) +
facet_grid(time ~ day),
qplot(total_bill, tip, data = subset(tips, smoker == "Yes" & sex == "Female")) +
facet_grid(time ~ day),
qplot(total_bill, tip, data = subset(tips, smoker == "No" & sex == "Male")) +
facet_grid(time ~ day),
qplot(total_bill, tip, data = subset(tips, smoker == "Yes" & sex == "Male")) +
facet_grid(time ~ day)
)
pm <- ggmatrix(
plotList, nrow = 2, ncol = 2,
yAxisLabels = c("Female", "Male"),
xAxisLabels = c("Non Smoker", "Smoker"),
title = "Total Bill vs Tip",
showStrips = NULL # default
)
pm
```

```
pm <- ggmatrix(
plotList, nrow = 2, ncol = 2,
yAxisLabels = c("Female", "Male"),
xAxisLabels = c("Non Smoker", "Smoker"),
title = "Total Bill vs Tip",
showStrips = TRUE
)
pm
```

```
pm <- ggmatrix(
plotList, nrow = 2, ncol = 2,
yAxisLabels = c("Female", "Male"),
xAxisLabels = c("Non Smoker", "Smoker"),
title = "Total Bill vs Tip",
showStrips = FALSE
)
pm
```

`ggnetworkmap`

is a function for plotting elegant maps using `ggplot2`

. It builds on `ggnet`

by allowing to draw a network over a map, and is particularly intended for use with `ggmap`

.

This example is based on a tutorial by Nathan Yau at Flowing Data.

```
suppressMessages(library(network))
suppressMessages(library(sna))
suppressMessages(library(maps))
suppressMessages(library(ggplot2))
airports <- read.csv("http://datasets.flowingdata.com/tuts/maparcs/airports.csv", header = TRUE)
rownames(airports) <- airports$iata
# select some random flights
set.seed(1234)
flights <- data.frame(
origin = sample(airports[200:400, ]$iata, 200, replace = TRUE),
destination = sample(airports[200:400, ]$iata, 200, replace = TRUE)
)
# convert to network
flights <- network(flights, directed = TRUE)
# add geographic coordinates
flights %v% "lat" <- airports[ network.vertex.names(flights), "lat" ]
flights %v% "lon" <- airports[ network.vertex.names(flights), "long" ]
# drop isolated airports
delete.vertices(flights, which(degree(flights) < 2))
# compute degree centrality
flights %v% "degree" <- degree(flights, gmode = "digraph")
# add random groups
flights %v% "mygroup" <- sample(letters[1:4], network.size(flights), replace = TRUE)
# create a map of the USA
usa <- ggplot(map_data("usa"), aes(x = long, y = lat)) +
geom_polygon(aes(group = group), color = "grey65",
fill = "#f9f9f9", size = 0.2)
delete.vertices(flights, which(flights %v% "lon" < min(usa$data$long)))
delete.vertices(flights, which(flights %v% "lon" > max(usa$data$long)))
delete.vertices(flights, which(flights %v% "lat" < min(usa$data$lat)))
delete.vertices(flights, which(flights %v% "lat" > max(usa$data$lat)))
# overlay network data to map
ggnetworkmap(usa, flights, size = 4, great.circles = TRUE,
node.group = mygroup, segment.color = "steelblue",
ring.group = degree, weight = degree)
```

`Loading required package: geosphere`

`Loading required package: sp`

`Loading required package: methods`

This next example uses data from a Twitter spam community identified while exploring and trying to clear-up a group of tweets. After coloring the nodes based on their centrality, the odd structure stood out clearly.

`data(twitter_spambots)`

```
# create a world map
world <- fortify(map("world", plot = FALSE, fill = TRUE))
world <- ggplot(world, aes(x = long, y = lat)) +
geom_polygon(aes(group = group), color = "grey65",
fill = "#f9f9f9", size = 0.2)
# view global structure
ggnetworkmap(world, twitter_spambots)
```

Is the network really concentrated in the U.S.? Probably not. One of the odd things about the network, is a much higher proportion of the users gave locations that could be geocoded, than Twitter users generally.

Let’s see the network topology

`ggnetworkmap(net = twitter_spambots, arrow.size = 0.5)`

Coloring nodes according to degree centrality can highlight network structures.

```
# compute indegree and outdegree centrality
twitter_spambots %v% "indegree" <- degree(twitter_spambots, cmode = "indegree")
twitter_spambots %v% "outdegree" <- degree(twitter_spambots, cmode = "outdegree")
ggnetworkmap(net = twitter_spambots,
arrow.size = 0.5,
node.group = indegree,
ring.group = outdegree, size = 4) +
scale_fill_continuous("Indegree", high = "red", low = "yellow") +
labs(color = "Outdegree")
```

Some Twitter attributes have been included as vertex attributes.

```
# show some vertex attributes associated with each account
ggnetworkmap(net = twitter_spambots,
arrow.size = 0.5,
node.group = followers,
ring.group = friends,
size = 4,
weight = indegree,
label.nodes = TRUE, vjust = -1.5) +
scale_fill_continuous("Followers", high = "red", low = "yellow") +
labs(color = "Friends") +
scale_color_continuous(low = "lightgreen", high = "darkgreen")
```

`ggnostic`

is a display wrapper to `ggduo`

that displays full model diagnostics for each given explanatory variable. By default, `ggduo`

displays the residuals, leave-one-out model sigma value, leverage points, and Cook’s distance against each explanatory variable. The rows of the plot matrix can be expanded to include fitted values, standard error of the fitted values, standardized residuals, and any of the response variables. If the model is a linear model, stars are added according to the anova significance of each explanatory variable.

Most diagnostic plots contain reference line(s) to help determine if the model is fitting properly

**residuals:**- Type key:
`".resid"`

- A solid line is located at the expected value of 0 with dashed lines at the 95% confidence interval. (0 ± 1.96 *
*σ*) - Plot function:
`ggally_nostic_resid`

. - See also
`stats::residuals`

- Type key:
**standardized residuals:**- Type key:
`".std.resid"`

- Same as residuals, except the standardized residuals equal the regular residuals divided by sigma. The dashed lines are located at 0 ± 1.96 * 1.
- Plot function:
`ggally_nostic_std_resid`

. - See also
`stats::rstandard`

- Type key:
**leave-one-out model sigma:**- Type key:
`".sigma"`

- A solid line is located at the full model’s sigma value.
- Plot function:
`ggally_nostic_sigma`

. - See also
`stats::influence`

’s value on`sigma`

- Type key:
**leverage points:**- Type key:
`".hat"`

- The expected value for the diagonal of a hat matrix is
*p*/*n*. Points are considered leverage points if they are large than 2 **p*/*n*, where the higher line is drawn. - Plot function:
`ggally_nostic_hat`

. - See also
`stats::influence`

’s value on`hat`

- Type key:
**Cook’s distance:**- Type key:
`".cooksd"`

- Points that are larger than 4/
*n*line are considered highly influential points. Plot function:`ggally_nostic_cooksd`

. See also`stats::cooks.distance`

- Type key:
**fitted points:**- Type key:
`".fitted"`

- No reference lines by default.
- Default plot function:
`ggally_points`

. - See also
`stats::predict`

- Type key:
**standard error of fitted points**:- Type key:
`".se.fit"`

- No reference lines by default.
- Plot function:
`ggally_nostic_se_fit`

. - See also
`stats::fitted`

- Type key:
**response variables:**- Type key: (response name in data.frame)
- No reference lines by default.
- Default plot function:
`ggally_points`

.

Looking at the dataset `datasets::state.x77`

, we will fit a multiple regression model for Life Expectancy.

```
# make a data.frame and fix column names
state <- as.data.frame(state.x77)
colnames(state)[c(4, 6)] <- c("Life.Exp", "HS.Grad")
str(state)
```

```
'data.frame': 50 obs. of 8 variables:
$ Population: num 3615 365 2212 2110 21198 ...
$ Income : num 3624 6315 4530 3378 5114 ...
$ Illiteracy: num 2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2 ...
$ Life.Exp : num 69 69.3 70.5 70.7 71.7 ...
$ Murder : num 15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 ...
$ HS.Grad : num 41.3 66.7 58.1 39.9 62.6 63.9 56 54.6 52.6 40.6 ...
$ Frost : num 20 152 15 65 20 166 139 103 11 60 ...
$ Area : num 50708 566432 113417 51945 156361 ...
```

```
# fit full model
model <- lm(Life.Exp ~ ., data = state)
# reduce to "best fit" model with
model <- step(model, trace = FALSE)
summary(model)
```

```
Call:
lm(formula = Life.Exp ~ Population + Murder + HS.Grad + Frost,
data = state)
Residuals:
Min 1Q Median 3Q Max
-1.47095 -0.53464 -0.03701 0.57621 1.50683
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.103e+01 9.529e-01 74.542 < 2e-16 ***
Population 5.014e-05 2.512e-05 1.996 0.05201 .
Murder -3.001e-01 3.661e-02 -8.199 1.77e-10 ***
HS.Grad 4.658e-02 1.483e-02 3.142 0.00297 **
Frost -5.943e-03 2.421e-03 -2.455 0.01802 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7197 on 45 degrees of freedom
Multiple R-squared: 0.736, Adjusted R-squared: 0.7126
F-statistic: 31.37 on 4 and 45 DF, p-value: 1.696e-12
```

Next, we look at the variables for any high (|value| > 0.8) correlation values and general interaction behavior.

```
# look at variables for high correlation (none)
ggscatmat(state, columns = c("Population", "Murder", "HS.Grad", "Frost"))
```

All variables appear to be ok. Next, we look at the model diagnostics.

```
# look at model diagnostics
ggnostic(model)
```

```
`geom_smooth()` using method = 'loess'
`geom_smooth()` using method = 'loess'
`geom_smooth()` using method = 'loess'
`geom_smooth()` using method = 'loess'
```

- The residuals appear to be normally distributed. There are a couple residual outliers, but 2.5 outliers are expected.
- There are 5 leverage points according the diagonal of the hat matrix
- There are 2 leverage points according to Cook’s distance. One is
**much**larger than the other.

Let’s remove the largest data point first to try and define a better model.

```
# very high life expectancy
state[11, ]
```

```
Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
Hawaii 868 4963 1.9 73.6 6.2 61.9 0 6425
```

```
state_no_hawaii <- state[-11, ]
model_no_hawaii <- lm(Life.Exp ~ Population + Murder + HS.Grad + Frost, data = state_no_hawaii)
ggnostic(model_no_hawaii)
```

```
`geom_smooth()` using method = 'loess'
`geom_smooth()` using method = 'loess'
`geom_smooth()` using method = 'loess'
`geom_smooth()` using method = 'loess'
```

There are no more outrageous Cook’s distance values. The model without Hawaii appears to be a good fitting model.

`summary(model)`

```
Call:
lm(formula = Life.Exp ~ Population + Murder + HS.Grad + Frost,
data = state)
Residuals:
Min 1Q Median 3Q Max
-1.47095 -0.53464 -0.03701 0.57621 1.50683
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.103e+01 9.529e-01 74.542 < 2e-16 ***
Population 5.014e-05 2.512e-05 1.996 0.05201 .
Murder -3.001e-01 3.661e-02 -8.199 1.77e-10 ***
HS.Grad 4.658e-02 1.483e-02 3.142 0.00297 **
Frost -5.943e-03 2.421e-03 -2.455 0.01802 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7197 on 45 degrees of freedom
Multiple R-squared: 0.736, Adjusted R-squared: 0.7126
F-statistic: 31.37 on 4 and 45 DF, p-value: 1.696e-12
```

`summary(model_no_hawaii)`

```
Call:
lm(formula = Life.Exp ~ Population + Murder + HS.Grad + Frost,
data = state_no_hawaii)
Residuals:
Min 1Q Median 3Q Max
-1.48967 -0.50158 0.01999 0.54355 1.11810
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.106e+01 8.998e-01 78.966 < 2e-16 ***
Population 6.363e-05 2.431e-05 2.618 0.0121 *
Murder -2.906e-01 3.477e-02 -8.357 1.24e-10 ***
HS.Grad 3.728e-02 1.447e-02 2.576 0.0134 *
Frost -3.099e-03 2.545e-03 -1.218 0.2297
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6796 on 44 degrees of freedom
Multiple R-squared: 0.7483, Adjusted R-squared: 0.7254
F-statistic: 32.71 on 4 and 44 DF, p-value: 1.15e-12
```

Since there is only a marginal improvement by removing Hawaii, the original model should be used to explain life expectancy.

The following lines of code will display different modle diagnostic plot matrices for the same statistical model. The first one is of the default settings. The second adds color according to the `species`

. Finally, the third displays all possible columns and uses `ggally_smooth`

to display the fitted points and response variables.

```
flea_model <- step(lm(head ~ ., data = flea), trace = FALSE)
summary(flea_model)
```

```
Call:
lm(formula = head ~ species + tars1 + tars2 + aede1, data = flea)
Residuals:
Min 1Q Median 3Q Max
-3.0152 -1.2315 -0.0048 1.1490 3.6068
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.63693 6.09162 0.105 0.917034
speciesHeikert. 1.13651 1.25452 0.906 0.368171
speciesHeptapot. 5.22207 0.92110 5.669 3.18e-07 ***
tars1 0.06815 0.01989 3.426 0.001043 **
tars2 0.10160 0.03297 3.081 0.002974 **
aede1 0.17069 0.04275 3.993 0.000163 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.6 on 68 degrees of freedom
Multiple R-squared: 0.685, Adjusted R-squared: 0.6618
F-statistic: 29.57 on 5 and 68 DF, p-value: 7.997e-16
```

```
# default output
ggnostic(flea_model)
```

```
`geom_smooth()` using method = 'loess'
`geom_smooth()` using method = 'loess'
`geom_smooth()` using method = 'loess'
```

```
# color'ed output
ggnostic(flea_model, mapping = ggplot2::aes(color = species))
```

```
`geom_smooth()` using method = 'loess'
`geom_smooth()` using method = 'loess'
`geom_smooth()` using method = 'loess'
```

```
# full color'ed output
ggnostic(
flea_model,
mapping = ggplot2::aes(color = species),
columnsY = c("head", ".fitted", ".se.fit", ".resid", ".std.resid", ".hat", ".sigma", ".cooksd"),
continuous = list(default = ggally_smooth, .fitted = ggally_smooth)
)
```

```
`geom_smooth()` using method = 'loess'
`geom_smooth()` using method = 'loess'
`geom_smooth()` using method = 'loess'
`geom_smooth()` using method = 'loess'
`geom_smooth()` using method = 'loess'
`geom_smooth()` using method = 'loess'
```

`ggpairs`

is a special form of a `ggmatrix`

that produces a pairwise comparison of multivariate data. By default, `ggpairs`

provides two different comparisons of each pair of columns and displays either the density or count of the respective variable along the diagonal. With different parameter settings, the diagonal can be replaced with the axis values and variable labels.

There are many hidden features within ggpairs. Please take a look at the examples below to get the most out of ggpairs.

The `columns`

displayed default to all columns of the provided `data`

. To subset to only a few columns, use the `columns`

parameter.

```
data(tips, package = "reshape")
pm <- ggpairs(tips)
pm
```

```
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
```

```
## too many plots for this example.
## reduce the columns being displayed
## these two lines of code produce the same plot matrix
pm <- ggpairs(tips, columns = c(1, 6, 2))
pm <- ggpairs(tips, columns = c("total_bill", "time", "tip"), columnLabels = c("Total Bill", "Time of Day", "Tip"))
pm
```

```
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
```

Aesthetics can be applied to every subplot with the `mapping`

parameter.

```
library(ggplot2)
pm <- ggpairs(tips, mapping = aes(color = sex), columns = c("total_bill", "time", "tip"))
pm
```

```
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
```

Since the plots are default plots (or are helper functions from GGally), the aesthetic color is altered to be appropriate. Looking at the example above, ‘tip’ vs ‘total_bill’ (pm[3,1]) needs the `color`

aesthetic, while ‘time’ vs ‘total_bill’ needs the `fill`

aesthetic. If custom functions are supplied, no aesthetic alterations will be done.

There are three major sections of the pairwise matrix: `lower`

, `upper`

, and `diag`

. The `lower`

and `upper`

may contain three plot types: `continuous`

, `combo`

, and `discrete`

. The ‘diag’ only contains either `continuous`

or `discrete`

.

`continuous`

: both X and Y are continuous variables`combo`

: one X and Y variable is discrete while the other is continuous`discrete`

: both X and Y are discrete variables

To make adjustments to each section, a list of information may be supplied. The list can be comprised of the following elements:

`continuous`

:- a character string representing the tail end of a
`ggally_NAME`

function, or a custom function - current valid
`upper$continuous`

and`lower$continuous`

character strings:`'points'`

,`'smooth'`

,`'density'`

,`'cor'`

,`'blank'`

- current valid
`diag$continuous`

character strings:`'densityDiag'`

,`'barDiag'`

,`'blankDiag'`

- a character string representing the tail end of a
`combo`

:- a character string representing the tail end of a
`ggally_NAME`

function, or a custom function. (not applicable for a`diag`

list) - current valid
`upper$combo`

and`lower$combo`

character strings:`'box'`

,`'dot'`

,`'facethist'`

,`'facetdensity'`

,`'denstrip'`

,`'blank'`

- a character string representing the tail end of a
`discrete`

:- a character string representing the tail end of a
`ggally_NAME`

function, or a custom function - current valid
`upper$discrete`

and`lower$discrete`

character strings:`'ratio'`

,`'facetbar'`

,`'blank'`

- current valid
`diag$discrete`

character strings:`'barDiag'`

,`'blankDiag'`

- a character string representing the tail end of a
`mapping`

: if mapping is provided, only the section’s mapping will be overwritten

```
library(ggplot2)
pm <- ggpairs(
tips, columns = c("total_bill", "time", "tip"),
lower = list(
continuous = "smooth",
combo = "facetdensity",
mapping = aes(color = time)
)
)
pm
```

A section list may be set to the character string `"blank"`

or `NULL`

if the section should be skipped when printed.

```
pm <- ggpairs(
tips, columns = c("total_bill", "time", "tip"),
upper = "blank",
diag = NULL
)
pm
```

```
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
```

The `ggally_NAME`

functions do not provide all graphical options. Instead of supplying a character string to a `continuous`

, `combo`

, or `discrete`

element within `upper`

, `lower`

, or `diag`

, a custom function may be given.

The custom function should follow the api of

```
custom_function <- function(data, mapping, ...){
# produce ggplot2 object here
}
```

There is no requirement to what happens within the function, as long as a ggplot2 object is returned.

```
my_bin <- function(data, mapping, ..., low = "#132B43", high = "#56B1F7") {
ggplot(data = data, mapping = mapping) +
geom_bin2d(...) +
scale_fill_gradient(low = low, high = high)
}
pm <- ggpairs(
tips, columns = c("total_bill", "time", "tip"),
lower = list(
continuous = my_bin
)
)
pm
```

```
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
```

The examples above use default parameters to each of the subplots. One of the immediate parameters to be set it `binwidth`

. This parameters is only needed in the lower, combination plots where one variable is continuous while the other variable is discrete.

To change the default parameter `binwidth`

setting, we will `wrap`

the function. `wrap`

first parameter should be a character string or a custom function. The remaining parameters supplied to wrap will be supplied to the function at run time.

```
pm <- ggpairs(
tips, columns = c("total_bill", "time", "tip"),
lower = list(
combo = wrap("facethist", binwidth = 1),
continuous = wrap(my_bin, binwidth = c(5, 0.5), high = "red")
)
)
pm
```

To get finer control over parameters, please look into custom functions.

Please look at the vignette for ggmatrix on plot matrix manipulations.

Small ggpairs example:

```
pm <- ggpairs(tips, columns = c("total_bill", "time", "tip"))
# retrieve the third row, first column plot
p <- pm[3,1]
p <- p + aes(color = time)
p
```

```
pm[3,1] <- p
pm
```

```
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
```

Please look at the vignette for ggmatrix on plot matrix manipulations.

Small ggpairs example:

```
pmBW <- pm + theme_bw()
pmBW
```

```
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
```

John W Emerson, Walton A Green, Barret Schloerke, Jason Crowley, Dianne Cook, Heike Hofmann, Hadley Wickham. **The Generalized Pairs Plot**. *Journal of Computational and Graphical Statistics*, vol. 22, no. 1, pp. 79-91, 2012.

The primary function is `ggscatmat`

. It is similar to `ggpairs`

but only works for purely numeric multivariate data. It is faster than ggpairs, because less choices need to be made. It creates a matrix with scatterplots in the lower diagonal, densities on the diagonal and correlations written in the upper diagonal. Syntax is to enter the dataset, the columns that you want to plot, a color column, and an alpha level.

```
data(flea)
ggscatmat(flea, columns = 2:4, color="species", alpha=0.8)
```

In this plot, you can see that the three different species vary a little from each other in these three variables. Heptapot (blue) has smaller values on the variable “tars1” than the other two. The correlation between the three variables is similar for all species.

John W Emerson, Walton A Green, Barret Schloerke, Jason Crowley, Dianne Cook, Heike Hofmann, Hadley Wickham. **The Generalized Pairs Plot**. *Journal of Computational and Graphical Statistics*, vol. 22, no. 1, pp. 79-91, 2012.

This function produces Kaplan-Meier plots using `ggplot2`

. As a first argument, `ggsurv`

needs a `survfit`

object, created by the `survival`

package. Default settings differ for single stratum and multiple strata objects.

```
require(ggplot2)
require(survival)
```

`Loading required package: survival`

`require(scales)`

`Loading required package: scales`

```
data(lung, package = "survival")
sf.lung <- survival::survfit(Surv(time, status) ~ 1, data = lung)
ggsurv(sf.lung)
```

The legend color positions matches the survival order or each stratum, where the stratums that end at a lower value or time have a position that is lower in the legend.

```
sf.sex <- survival::survfit(Surv(time, status) ~ sex, data = lung)
pl.sex <- ggsurv(sf.sex)
pl.sex
```

Since a ggplot2 object is returned, plot objects may be altered after the original creation.

```
pl.sex +
ggplot2::guides(linetype = FALSE) +
ggplot2::scale_colour_discrete(
name = 'Sex',
breaks = c(1, 2),
labels = c('Male', 'Female')
)
```

```
Scale for 'colour' is already present. Adding another scale for
'colour', which will replace the existing scale.
```

```
data(kidney, package = "survival")
sf.kid <- survival::survfit(Surv(time, status) ~ disease, data = kidney)
pl.kid <- ggsurv(sf.kid, plot.cens = FALSE)
pl.kid
```

```
# Zoom in to first 80 days
pl.kid + ggplot2::coord_cartesian(xlim = c(0, 80), ylim = c(0.45, 1))
```

```
pl.kid +
ggplot2::annotate(
"text",
label = c("PKD", "Other", "GN", "AN"),
x = c(90, 125, 5, 60),
y = c(0.8, 0.65, 0.55, 0.30),
size = 5,
colour = scales::hue_pal(
h = c(0, 360) + 15,
c = 100,
l = 65,
h.start = 0,
direction = 1
)(4)
) +
ggplot2::guides(color = FALSE, linetype = FALSE)
```