skimrThe motivation of this project was to create a frictionless approach to quickly viewing summary statistics as part of a pipeline. There are many existing summary functions, but we found them lacking in one way or another because they can be generic, they don’t always provide easy-to-operate-on data structures, and they are not pipeable.
So at rOpenSci #unconf17, we created a new package that would let you quickly skim useful, tidy summary statistics directly from a pipe.
And so we created skimr.
In a nutshell, skimr will create a skimr object that can be further operated upon or that provides a human-readable printout in the console. It presents reasonable default summary statistics for numerics, factors, etc, and lists counts, and missing and unique values.
Amelia McNamara
Job Title: Visiting Assistant Professor of Statistical & Data Sciences at Smith College
Project Contributions: Coder
Eduardo Arino de la Rubia
Job Title: Chief Data Scientist at Domino Data Lab
Project Contributions: Coder
Hao Zhu
Job Title: Programmer Analyst at the Institute for Aging Research
Project Contributions: Coder
Julia Lowndes
Job Title: Marine Data Scientist at the National Center for Ecological Analysis and Synthesis
Project Contributions: Documention and test scripts
Shannon Ellis
Job Title: Postdoctoral fellow in the Biostatistics Department at the Johns Hopkins Bloomberg School of Public Health
Project Contributions: Test Scripts
Elin Waring
Job Title: Professor at Lehman College Sociology Department, City University of New York
Project Contributions: Coder
Michael Quinn
Job Title: Quantitative Analyst at Google
Project Contributions: Coder
Hope McLeod
Job Title: Data Engineer at Kobalt Music
Project Contributions: Documentation
We started off by brainstorming what we liked about existing summary packages and what other features we wanted. We started looking at example data, mtcars.
str(mtcars)
summary(mtcars)# "I like what we get here because mpg is numeric so these stats make sense:" 
summary(mtcars$mpg) ##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   15.42   19.20   20.09   22.80   33.90# "But I don’t like this because cyl should really be a factor and shouldn't have these stats:"
summary(mtcars$cyl)##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.000   4.000   6.000   6.188   8.000   8.000# "This is OK, but not descriptive enough. It could be clearer what I'm looking at."
mosaic::tally(~cyl, data=mtcars) # install.packages('mosaic')## cyl
##  4  6  8 
## 11  7 14# "But this output isn't labeled, not ideal." 
table(mtcars$cyl, mtcars$vs)##    
##      0  1
##   4  1 10
##   6  3  4
##   8 14  0# "I like this because it returns 'sd', 'n' and 'missing'":
mosaic::favstats(~mpg, data=mtcars) ##   min     Q1 median   Q3  max     mean       sd  n missing
##  10.4 15.425   19.2 22.8 33.9 20.09062 6.026948 32       0skimr