ggstatsplot
介绍ggstatsplot
是ggplot2
包的扩展,主要用于创建美观的图片同时自动输出统计学分析结果,其统计学分析结果包含统计分析的详细信息,该包对于经常需要做统计分析的科研工作者来说非常有用。
一般情况下,数据可视化和统计建模是两个不同的阶段。而ggstatsplot的核心思想很简单:将这两个阶段合并为输出具有统计细节的图片,使数据探索更简单,更快捷。
ggstatsplot
在统计学分析方面:目前它支持最常见的统计测试类型:t-test / anova,非参数,相关性分析,列联表分析和回归分析。而在图片输出方面:(1)小提琴图(用于不同组之间连续数据的异同分析);(2)饼图(用于分类数据的分布检验);(3)条形图(用于分类数据的分布检验);(4)散点图(用于两个变量之间的相关性分析);(5)相关矩阵(用于多个变量之间的相关性分析);(6)直方图和点图/图表(关于分布的假设检验);(7)点须图(用于回归模型)。
ggstatsplot
包的常用函数此函数可创建小提琴图,箱形图或两者的混合,主要用于组间或条件之间的连续数据的比较, 最简单的函数调用看起来像这样
# loading needed librarieslibrary(ggstatsplot)# for reproducibilityset.seed(123)# plotggstatsplot::ggbetweenstats( data = iris, x = Species, y = Sepal.Length, messages = FALSE) + # further modification outside of ggstatsplot ggplot2::coord_cartesian(ylim = c(3, 8)) + ggplot2::scale_y_continuous(breaks = seq(3, 8, by = 1))
从该图我们可以看出不同种类的iris在 Sepal.Length上有显著差异。但是其实我们可以修改参数,让该图看起来更加富有信息。
library(ggplot2)# for reproducibilityset.seed(123)# let's leave out one of the factor levels and see if instead of anova, a t-test will be runiris2 <- dplyr::filter(.data = iris, Species != "setosa")# let's change the levels of our factors, a common routine in data analysis# pipeline, to see if this function respects the new factor levelsiris2$Species <- base::factor( x = iris2$Species, levels = c("virginica", "versicolor") )# plotggstatsplot::ggbetweenstats( data = iris2, x = Species, y = Sepal.Length, notch = TRUE, # show notched box plot mean.plotting = TRUE, # whether mean for each group is to be displayed mean.ci = TRUE, # whether to display confidence interval for means mean.label.size = 2.5, # size of the label for mean type = "p", # which type of test is to be run k = 3, # number of decimal places for statistical results outlier.tagging = TRUE, # whether outliers need to be tagged outlier.label = Sepal.Width, # variable to be used for the outlier tag outlier.label.color = "darkgreen", # changing the color for the text label xlab = "Type of Species", # label for the x-axis variable ylab = "Attribute: Sepal Length", # label for the y-axis variable title = "Dataset: Iris flower data set", # title text for the plot ggtheme = ggthemes::theme_fivethirtyeight(), # choosing a different theme ggstatsplot.layer = FALSE, # turn off ggstatsplot theme layer package = "wesanderson", # package from which color palette is to be taken palette = "Darjeeling1", # choosing a different color palette messages = FALSE)
ggbetweenstats函数的功能几乎与ggwithinstats相同。
# for reproducibility and dataset.seed(123)data("iris")ggstatsplot::ggwithinstats( data = iris, x = Species, y = Sepal.Length, messages = FALSE)
# plotggstatsplot::ggwithinstats( data = iris, x = Species, y = Sepal.Length, sort = "descending", # ordering groups along the x-axis based on sort.fun = median, # values of `y` variable pairwise.comparisons = TRUE, pairwise.display = "s", pairwise.annotation = "p", title = "iris", caption = "Data from: iris", ggtheme = ggthemes::theme_fivethirtyeight(), ggstatsplot.layer = FALSE, messages = FALSE)
此函数使用ggExtra :: ggMarginal中的边缘直方图/箱线图/密度/小提琴/ densigram图创建散点图,并在副标题中显示统计分析结果:
ggstatsplot::ggscatterstats( data = ggplot2::msleep, x = sleep_rem, y = awake, xlab = "REM sleep (in hours)", ylab = "Amount of time spent awake (in hours)", title = "Understanding mammalian sleep", messages = FALSE)
该图表达的是sleep_rem与awake存在相关性,其中X轴为sleep_rem,Y轴为awake。该图中右侧和上方的直方图代表的是数据的分布。该段数据越多,其柱子越高。
# for reproducibilityset.seed(123)# plotggstatsplot::ggscatterstats( data = dplyr::filter(.data = ggstatsplot::movies_long, genre == "Action"), x = budget, y = rating, type = "robust", # type of test that needs to be run conf.level = 0.99, # confidence level xlab = "Movie budget (in million/ US$)", # label for x axis ylab = "IMDB rating", # label for y axis label.var = "title", # variable for labeling data points label.expression = "rating < 5 & budget > 100", # expression that decides which points to label line.color = "yellow", # changing regression line color line title = "Movie budget and IMDB rating (action)", # title text for the plot caption = expression( # caption text for the plot paste(italic("Note"), ": IMDB stands for Internet Movie DataBase") ), ggtheme = theme_bw(), # choosing a different theme ggstatsplot.layer = FALSE, # turn off ggstatsplot theme layer marginal.type = "density", # type of marginal distribution to be displayed xfill = "#0072B2", # color fill for x-axis marginal distribution yfill = "#009E73", # color fill for y-axis marginal distribution xalpha = 0.6, # transparency for x-axis marginal distribution yalpha = 0.6, # transparency for y-axis marginal distribution centrality.para = "median", # central tendency lines to be displayed messages = FALSE # turn off messages and notes)
ggbarstats函数主要用于展示不同组之间分类数据的分布问题。比如说说A组患者中,男女的比例是否与B组患者中男女的比例存在异同。
# for reproducibilityset.seed(123)# plotggstatsplot::ggbarstats( data = ggstatsplot::movies_long, main = mpaa, condition = genre, sampling.plan = "jointMulti", title = "MPAA Ratings by Genre", xlab = "movie genre", perc.k = 1, x.axis.orientation = "slant", ggtheme = hrbrthemes::theme_modern_rc(), ggstatsplot.layer = FALSE, ggplot.component = ggplot2::theme(axis.text.x = ggplot2::element_text(face = "italic")), palette = "Set2", messages = FALSE)
该图比较的是不同组之间,分类数据的分布是否存在异同。同样可以修改参数让它显得更加复杂和美观。
如果您希望查看一个变量的分布并通过一个样本测试检查它是否与指定值明显不同,则此功能将允许您这样做。
ggstatsplot::gghistostats( data = ToothGrowth, # dataframe from which variable is to be taken x = len, # numeric variable whose distribution is of interest title = "Distribution of Sepal.Length", # title for the plot fill.gradient = TRUE, # use color gradient test.value = 10, # the comparison value for t-test test.value.line = TRUE, # display a vertical line at test value type = "bf", # bayes factor for one sample t-test bf.prior = 0.8, # prior width for calculating the bayes factor messages = FALSE # turn off the messages)
此函数类似于gghistostats,但在数字变量也有标签时使用。
# for reproducibilityset.seed(123)# plotggdotplotstats( data = dplyr::filter(.data = gapminder::gapminder, continent == "Asia"), y = country, x = lifeExp, test.value = 55, test.value.line = TRUE, test.line.labeller = TRUE, test.value.color = "red", centrality.para = "median", centrality.k = 0, title = "Distribution of life expectancy in Asian continent", xlab = "Life expectancy", messages = FALSE, caption = substitute( paste( italic("Source"), ": Gapminder dataset from https://www.gapminder.org/" ) ))
ggcorrmat函数主要用于变量之间的相关性分析
# for reproducibilityset.seed(123)# as a default this function outputs a correlalogram plotggstatsplot::ggcorrmat( data = ggplot2::msleep, corr.method = "robust", # correlation method sig.level = 0.001, # threshold of significance p.adjust.method = "holm", # p-value adjustment method for multiple comparisons cor.vars = c(sleep_rem, awake:bodywt), # a range of variables can be selected cor.vars.names = c( "REM sleep", # variable names "time awake", "brain weight", "body weight" ), matrix.type = "upper", # type of visualization matrix colors = c("#B2182B", "white", "#4D4D4D"), title = "Correlalogram for mammals sleep dataset", subtitle = "sleep units: hours; weight units: kilograms")
ggcoefstats创建了很多回归系数的点估计值作为带有置信区间的点。
# for reproducibilityset.seed(123)# modelmod <- stats::lm( formula = mpg ~ am * cyl, data = mtcars)# plotggstatsplot::ggcoefstats(x = mod)
# for reproducibilityset.seed(123)# loading the needed librarieslibrary(yarrr)library(ggstatsplot)# using `ggstatsplot` to get call with statistical resultsstats_results <- ggstatsplot::ggbetweenstats( data = ChickWeight, x = Time, y = weight, return = "subtitle", messages = FALSE )# using `yarrr` to create plotyarrr::pirateplot( formula = weight ~ Time, data = ChickWeight, theme = 1, main = stats_results)
如图所示,我们使用yarrr
包绘制图片,但是同时使用了来自ggstatsplot
包得到的stats_results结果
联系客服