Hands-on dplyr tutorial for faster data manipulation in R

Watch the follow-up tutorial: http://youtu.be/2mh1PqfsXVI
View the R Markdown document: http://rpubs.com/justmarkham/dplyr-tutorial
Download the source document: https://github.com/justmarkham/dplyr-tutorial
Read about why I love dplyr: http://www.dataschool.io/dplyr-tutorial-for-faster-data-manipulation-in-r/

dplyr is a new R package for data manipulation. Using a series of examples on a dataset you can download, this tutorial covers the five basic dplyr “verbs” as well as a dozen other dplyr functions.

Tutorial contents:
1. Introduction to dplyr (starts at 0:00)
2. Loading dplyr and the example dataset (starts at 2:29)
3. Understanding “local data frames” (starts at 3:23)
4. Verb #1: `filter` (starts at 5:17)
5. Verb #2: `select`, plus `contains`, `starts_with`, `ends_with`, `matches` (starts at 7:54)
6. Using chaining syntax for more readable code (starts at 9:34)
7. Verb #3: `arrange` (starts at 12:53)
8. Verb #4: `mutate` (starts at 13:55)
9. Verb #5: `summarise`, plus `group_by`, `summarise_each`, `n`, `n_distinct`, `tally` (starts at 15:31)
10. Window functions: `min_rank`, `top_n`, `lag` (starts at 26:47)
11. Convenience functions: `sample_n`, `sample_frac`, `glimpse` (starts at 32:44)
12. Connecting to databases (starts at 34:21)

== RESOURCES ==

Reference manual and vignettes: http://cran.r-project.org/web/packages/dplyr/index.html
July 2014 webinar: http://pages.rstudio.net/Webinar-Series-Recording-Essential-Tools-for-R.html
July 2014 webinar code: https://github.com/rstudio/webinars/tree/master/2014-01
Tutorial by Hadley Wickham: https://www.dropbox.com/sh/i8qnluwmuieicxc/AAAgt9tIKoIm7WZKIyK25lh6a
GitHub repo: https://github.com/hadley/dplyr
List of releases: https://github.com/hadley/dplyr/releases

== LET’S CONNECT! ==

Blog: http://www.dataschool.io
Newsletter: http://www.dataschool.io/subscribe/
Twitter: https://twitter.com/justmarkham
GitHub: https://github.com/justmarkham

Comments

Rob van Mechelen says:

Excellent! Thank you

Jaded Hackneyed says:

Your tutorial is meticulous, clear and useful for those who are used to basic R approach but feels a need to learn dplyr package. This not lengthy video does help me to write R-code in an efficient and convenient manner. Thanks.

oregono says:

So question on the group_by:

I ran the following from an csv file referenced and it gives me aggregates – but they are all the exact same number 😐

exceldata %>%
group_by(Ability.I,Type.I) %>%
summarize(test1=mean(exceldata$HP,na.rm=TRUE))

Punit kaur says:

38 mins well spent thanks for an awsum tutorial!!!

Dheeru Kura says:

Explanation was awesome. It’s changed n improved my perception towards Rstudio

Kumar Siddhartha says:

Hey, can you make a video on a reporting package or multiple packages which can depict the data base or simply a table however I want. Like column names containing formula and things like that. In short, I want a reporting package which allows me to manipulate the table and contents, which I report, as much as possible. Thanks.

anto cdt says:

THKS !

Leore Lavin says:

Really clear and informative, thank you!

Jiawei Hugo Zhou says:

in the video, summarise_each() is deprecated as of 2017, guys can use the code below

flights %>%
group_by(UniqueCarrier) %>%
summarise_at(.var= c(“Cancelled”,”Diverted”),.funs=mean)

jordan Ndetcho says:

Very easy to understand, straight to the point and useful for beginners like me ^^
thank you very much !

AK2016 says:

Very good! Thank you.

John K says:

very helpful

Brenden Morley says:

THank you …. very detailed and informative

AliendaroN says:

Great Video, thank you a lot 😉

David Juarez says:

Was dplyr replaced by another package?

Bryan Wu says:

This is great! I really love the fact that you show the “base R” approach as a comparison. Looking forward to more vids.

Papercraftfreak3 says:

This tutorial is the most helpful resource I’ve found thus far for my Data Science project. Thank you so much for posting!!

HP says:

Thank you! Very clear and helpful.

Prasad Vittala says:

hii , is there any necessity to create a local data frame from the original data frame?

Robert Noble says:

Is there an AND/OR condition that you can use ?
,|
Would that work?

I merged two huge data frames. Some of them have the same name for a variable even though they are different. They are H.x (hits for batting), H.y (Hits for pitch