Preamble
Over the past couple of years, I’ve had the privilege to advance my R skills, as well as acquire useful functions that should aid anyone using R for behavioral science. This list is not exhaustive, but a list of my most used functions, packages, and useful tips!
Pipe %>%
Operator
The pipe operator, written as %>%
takes the output of one function and passes it into another function as an argument. This allows us to link a sequence of analysis steps.
For a mathematical analogy, f(x)
can be rewritten as x %>% f
## compute the logarithm of `x`
x <- 1
log(x)
## [1] 0
## compute the logaritm of `x`
x %>% log()
## [1] 0
Why is this useful though? R is a functional language, which means that your code often contains a lot of parenthesis, (
and )
. When you have complex code, this often will mean that you will have to nest those parentheses together. This makes your R code hard to read and understand.
# Initialize `x`
x <- c(0.109, 0.359, 0.63, 0.996, 0.515, 0.142, 0.017, 0.829, 0.907)
# Compute the logarithm of `x`, return suitably lagged and
# iterated differences,
# compute the exponential function and round the result
round(exp(diff(log(x))), 1)
## [1] 3.3 1.8 1.6 0.5 0.3 0.1 48.8 1.1
# Compute the same computation as above but with pipe in operator
x %>% log() %>%
diff() %>%
exp() %>%
round(1)
## [1] 3.3 1.8 1.6 0.5 0.3 0.1 48.8 1.1
In short, here are four reasons why you should be using pipes in R:
You’ll structure the sequence of your data operations from left to right, as apposed to from inside and out;
You’ll avoid nested function calls;
You’ll minimize the need for local variables and function definitions;
You’ll make it easy to add steps anywhere in the sequence of operations.
dplyr package
By far my most used package in R is dplyr
. See documentation here
dplyr is part of the tidyverse collection of R packages for data science. At it’s core, there are 5 functions which I use (typically chained with the pipe in operator %>%
) for every single analysis:
mutate()
adds new variables that are functions of existing variables
select()
picks variables based on their names.
filter()
picks cases based on their values.
summarise()
reduces multiple values down to a single summary.
arrange()
changes the ordering of the rows.
group_by
allows for group operations in the “split-apply-combine” concept
I’ll demonstrate below using strictly dplyr functions with the datasets PlantGrowth
which are the results of an experiment on Plant Growth with 3 conditions and mtcars
which are fuel consumption and 10 aspects of automobile design and performance for 32 automobiles.
library(dplyr)
summary(PlantGrowth)
## weight group
## Min. :3.590 ctrl:10
## 1st Qu.:4.550 trt1:10
## Median :5.155 trt2:10
## Mean :5.073
## 3rd Qu.:5.530
## Max. :6.310
# calculate the average weight of the plants by condition
PlantGrowth %>%
group_by(group) %>%
summarise(mean_growth = mean(weight))
## # A tibble: 3 x 2
## group mean_growth
## <fct> <dbl>
## 1 ctrl 5.03
## 2 trt1 4.66
## 3 trt2 5.53
# create a new column of weight from lbs to kg (1 lb = 0.453kg)
# filter 6 cylinder cars only
# isolate model name, mpg, and wt in kg
# arange the data from lightest to heaviest
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
mtcars %>%
mutate(wt_kg = (wt*1000)*0.453) %>%
filter(cyl == 6) %>%
select(mpg, wt_kg) %>%
arrange(wt_kg)
## mpg wt_kg
## Mazda RX4 21.0 1186.860
## Ferrari Dino 19.7 1254.810
## Mazda RX4 Wag 21.0 1302.375
## Hornet 4 Drive 21.4 1456.395
## Merc 280 19.2 1558.320
## Merc 280C 17.8 1558.320
## Valiant 18.1 1567.380
As you can see, with only a few lines of code, we can chain various cleaning commands together and produce a desirable output. I highly recommend the dplyr package for all data cleaning purposes. Here’s a very nice cheat sheet that you should bookmark.
tidy data
I believe most of the time spent doing data analysis is actually spent doing data cleaning. While data cleaning is typically the first step, it typically must be repeated many times over the course of analysis as new problems come to light or new data is collected. To this end, tidying data is a way to structure datasets to facilitate analysis.
A dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative). Values are organized in two ways. Every value belongs to a variable and an observation. A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units. An observation contains all values measured on the same unit (like a person, or a day, or a race) across attributes.
Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:
Every column is a variable.
Every row is an observation.
Every cell is a single value.
While these are the main principles behind tidy data, there’s a lot of nuances and hundreds of data sets that break these rules. Practice is the best lesson here and you’ll find that once you have assembled a tidy data set, it will make conducting statistical analysis and visualizations 100% easier. I’ll provide two examples of non-tidy data followed by a tidy data set.
# FIRST EXAMPLE
head(relig_income)
## # A tibble: 6 x 11
## religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Agnostic 27 34 60 81 76 137 122
## 2 Atheist 12 27 37 52 35 70 73
## 3 Buddhist 27 21 30 34 33 58 62
## 4 Catholic 418 617 732 670 638 1116 949
## 5 Don’t kn~ 15 14 15 11 10 35 21
## 6 Evangeli~ 575 869 1064 982 881 1486 949
## # ... with 3 more variables: $100-150k <dbl>, >150k <dbl>,
## # Don't know/refused <dbl>
# notice the column names, let's fix that
relig_income %>%
pivot_longer(-religion, names_to = "income", values_to = "frequency")
## # A tibble: 180 x 3
## religion income frequency
## <chr> <chr> <dbl>
## 1 Agnostic <$10k 27
## 2 Agnostic $10-20k 34
## 3 Agnostic $20-30k 60
## 4 Agnostic $30-40k 81
## 5 Agnostic $40-50k 76
## 6 Agnostic $50-75k 137
## 7 Agnostic $75-100k 122
## 8 Agnostic $100-150k 109
## 9 Agnostic >150k 84
## 10 Agnostic Don't know/refused 96
## # ... with 170 more rows
This dataset has three variables: religion
, income
and frequency
. To tidy it, we needed to pivot the non-variable columns into a two-column key-value pair. This action is often described as making a wide dataset longer.
When pivoting variables, we needed to provide the name of the new key-value columns to create. After defining the columns to pivot (every column except for religion
), you will need the name of the key column, which is the name of the variable defined by the values of the column headings. In this case, it’s income
. The second argument is the name of the value column, frequency
.
# SECOND EXAMPLE
head(billboard)
## # A tibble: 6 x 79
## artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 Pac Baby Do~ 2000-02-26 87 82 72 77 87 94 99 NA
## 2 2Ge+her The Har~ 2000-09-02 91 87 92 NA NA NA NA NA
## 3 3 Doors~ Krypton~ 2000-04-08 81 70 68 67 66 57 54 53
## 4 3 Doors~ Loser 2000-10-21 76 76 72 69 67 65 55 59
## 5 504 Boyz Wobble ~ 2000-04-15 57 34 25 17 17 31 36 49
## 6 98^0 Give Me~ 2000-08-19 51 39 34 26 26 19 2 2
## # ... with 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## # wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## # wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## # wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## # wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## # wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
## # wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, ...
The above dataset records the date a song first entered the billboard top 100. It has variables for artist
, track
, date.entered
, rank
and week
. The rank in each week after it enters the top 100 is recorded in 75 columns, wk1
to wk75
. This form of storage is not tidy, but it is useful for data entry. It reduces duplication since otherwise each song in each week would need its own row, and song metadata like title and artist would need to be repeated.
billboard %>%
pivot_longer(
wk1:wk76,
names_to = "week",
values_to = "rank",
values_drop_na = TRUE
) %>%
mutate(week = as.integer(gsub("wk", "", week)),
date = as.Date(as.Date(date.entered) + 7 * (week - 1)),
date.entered = NULL)
## # A tibble: 5,307 x 5
## artist track week rank date
## <chr> <chr> <int> <dbl> <date>
## 1 2 Pac Baby Don't Cry (Keep... 1 87 2000-02-26
## 2 2 Pac Baby Don't Cry (Keep... 2 82 2000-03-04
## 3 2 Pac Baby Don't Cry (Keep... 3 72 2000-03-11
## 4 2 Pac Baby Don't Cry (Keep... 4 77 2000-03-18
## 5 2 Pac Baby Don't Cry (Keep... 5 87 2000-03-25
## 6 2 Pac Baby Don't Cry (Keep... 6 94 2000-04-01
## 7 2 Pac Baby Don't Cry (Keep... 7 99 2000-04-08
## 8 2Ge+her The Hardest Part Of ... 1 91 2000-09-02
## 9 2Ge+her The Hardest Part Of ... 2 87 2000-09-09
## 10 2Ge+her The Hardest Part Of ... 3 92 2000-09-16
## # ... with 5,297 more rows
To tidy this dataset, we first used pivot_longer()
to make the dataset longer. We transform the columns from wk1
to wk76
, making a new column for their names: week
, and a new value for their values: rank
. Next, we use values_drop_na = TRUE
to drop any missing values from the rank column. In this data, missing values represent weeks that the song wasn’t in the charts, so it can be safely dropped.
In this case it’s also nice to do a little cleaning, converting the week variable to a number, and figuring out the date corresponding to each week on the charts.
These are a couple examples of how to tidy data, but having worked with hundreds of datasets from different sources, there will always be unique challenges that require creative thinking and patience.