Hands-on dplyr tutorial for faster data manipulation in R

Watch the follow-up tutorial: http://youtu.be/2mh1PqfsXVI
View the R Markdown document: http://rpubs.com/justmarkham/dplyr-tutorial
Download the source document: https://github.com/justmarkham/dplyr-tutorial
Read about why I love dplyr: http://www.dataschool.io/dplyr-tutorial-for-faster-data-manipulation-in-r/

dplyr is a new R package for data manipulation. Using a series of examples on a dataset you can download, this tutorial covers the five basic dplyr “verbs” as well as a dozen other dplyr functions.

Tutorial contents:
1. Introduction to dplyr (starts at 0:00)
2. Loading dplyr and the example dataset (starts at 2:29)
3. Understanding “local data frames” (starts at 3:23)
4. Verb #1: `filter` (starts at 5:17)
5. Verb #2: `select`, plus `contains`, `starts_with`, `ends_with`, `matches` (starts at 7:54)
6. Using chaining syntax for more readable code (starts at 9:34)
7. Verb #3: `arrange` (starts at 12:53)
8. Verb #4: `mutate` (starts at 13:55)
9. Verb #5: `summarise`, plus `group_by`, `summarise_each`, `n`, `n_distinct`, `tally` (starts at 15:31)
10. Window functions: `min_rank`, `top_n`, `lag` (starts at 26:47)
11. Convenience functions: `sample_n`, `sample_frac`, `glimpse` (starts at 32:44)
12. Connecting to databases (starts at 34:21)


Reference manual and vignettes: http://cran.r-project.org/web/packages/dplyr/index.html
July 2014 webinar: http://pages.rstudio.net/Webinar-Series-Recording-Essential-Tools-for-R.html
July 2014 webinar code: https://github.com/rstudio/webinars/tree/master/2014-01
Tutorial by Hadley Wickham: https://www.dropbox.com/sh/i8qnluwmuieicxc/AAAgt9tIKoIm7WZKIyK25lh6a
GitHub repo: https://github.com/hadley/dplyr
List of releases: https://github.com/hadley/dplyr/releases


Blog: http://www.dataschool.io
Newsletter: http://www.dataschool.io/subscribe/
Twitter: https://twitter.com/justmarkham
GitHub: https://github.com/justmarkham


Rob van Mechelen says:

Excellent! Thank you

Jaded Hackneyed says:

Your tutorial is meticulous, clear and useful for those who are used to basic R approach but feels a need to learn dplyr package. This not lengthy video does help me to write R-code in an efficient and convenient manner. Thanks.

oregono says:

So question on the group_by:

I ran the following from an csv file referenced and it gives me aggregates – but they are all the exact same number 😐

exceldata %>%
group_by(Ability.I,Type.I) %>%

Punit kaur says:

38 mins well spent thanks for an awsum tutorial!!!

Dheeru Kura says:

Explanation was awesome. It’s changed n improved my perception towards Rstudio

Kumar Siddhartha says:

Hey, can you make a video on a reporting package or multiple packages which can depict the data base or simply a table however I want. Like column names containing formula and things like that. In short, I want a reporting package which allows me to manipulate the table and contents, which I report, as much as possible. Thanks.

anto cdt says:


Leore Lavin says:

Really clear and informative, thank you!

Jiawei Hugo Zhou says:

in the video, summarise_each() is deprecated as of 2017, guys can use the code below

flights %>%
group_by(UniqueCarrier) %>%
summarise_at(.var= c(“Cancelled”,”Diverted”),.funs=mean)

jordan Ndetcho says:

Very easy to understand, straight to the point and useful for beginners like me ^^
thank you very much !

AK2016 says:

Very good! Thank you.

John K says:

very helpful

Brenden Morley says:

THank you …. very detailed and informative

AliendaroN says:

Great Video, thank you a lot 😉

David Juarez says:

Was dplyr replaced by another package?

Bryan Wu says:

This is great! I really love the fact that you show the “base R” approach as a comparison. Looking forward to more vids.

Papercraftfreak3 says:

This tutorial is the most helpful resource I’ve found thus far for my Data Science project. Thank you so much for posting!!

HP says:

Thank you! Very clear and helpful.

Prasad Vittala says:

hii , is there any necessity to create a local data frame from the original data frame?

Robert Noble says:

Is there an AND/OR condition that you can use ?
Would that work?

I merged two huge data frames. Some of them have the same name for a variable even though they are different. They are H.x (hits for batting), H.y (Hits for pitching), for instance, but other variables are exactly the same. Example: Year.x (the year a player played for hitting), Year.y (the year a player pitched). When working with that huge data frame, I would not want to confuse Hits off of a pitcher and its that a batter made because they are two distinct variables. But for a the years that a player played, I would not want to miss any rows (player) for any given year.

Would that be And, Or, And/Or (if that exists) or wouldn’t it matter? I think Or could work as a useful way but is that the best choice. It seems as though it should matter. Thanks.

Nitin sethi says:

Need this dataset…

nakul menon says:

That was a well spent 40 minutes. Very neat, precise, and easy to understand. You have my additional gratitude for comparing the dpylr to the base R functions, which helped in visualizing why dpylr is practical. Thank You!

Amey n says:

Excellent and very easy to understand

Enzo C. says:

Great, helps a lot.. Thanks!

christopher guth says:

awesome tutorial man very helpfull and easy to understand !

Ashok Anumandla says:

Great video on dplyr. Helps a lot for data manipulations

John Sheehan says:

Thank you for putting these tutorials together. They are FANTASTIC for the R newbie. And I particularly love that you have the R Markdown version that we can keep for reference.

 Write a comment


Do you like our videos?
Do you want to see more like that?

Please click below to support us on Facebook!