Introduction to Data Science with R – Data Analysis Part 1

Part 1 in a in-depth hands-on tutorial introducing the viewer to Data Science with R programming. The video provides end-to-end data science training, including data exploration, data wrangling, data analysis, data visualization, feature engineering, and machine learning. All source code from videos are available from GitHub.

NOTE – The data for the competition has changed since this video series was started. You can find the applicable .CSVs in the GitHub repo.

Blog: http://daveondata.com
GitHub: https://github.com/EasyD/IntroToDataScience

I do Data Science training as a Bootcamp: https://goo.gl/OhIHSc

Comments

Free Music Downloader - All Music says:

Good Video

Pritom Kumar says:

Can I get the data “test” and “train”. I want to practice those codes.

Achin says:

i beg you i touch your feet please tell about following in the big data : Introduction to “R”, analyzing and exploring data with “R”, statistics for model building and evaluation.

Lakshmi Pusuluri Pusuluri says:

Hi I am new to data science and your explanation was very helpful for the beginners, Thank you very much!!!

I have one question, you have added a new column to the test and created new data object. my question why shouldn’t we add a new column to the test it self? reason is not to change the original object? which one is is faster operation adding new column or create a new object with new column added ?

jinay gala says:

You Sir are one of the best instructors I have ever met. The detailed explanation of each concept clears all doubt as soon as it arrives. And probably because of your video I might switch into this career path. So essentially, thanks in advance.

Dilshan Ravisara says:

Trying to combine 2 data.frame on R with rbind the line of code is data.combined <- rbind(train, test.survived) getting a error in match.names(clabs, names(xi)) : names do not match previous names... How to fix this error ???

anupam dewan says:

Tip1: In the last ggplot as the data to be plotted is restricted to 1:891 rows, the aesthetics parameter(x and fill) should also be restricted to 891 rows essentially.

Correct: ggplot(data.combined[1:891,],aes(x=titles[1:891],fill=Survived[1:891]))
geom_bar(width=0.4)+
facet_wrap(~Pclass)+
ggtitle(“Pclass”)+
xlab(“Title”)+
ylab(“count”)+
labs(“title wise survivals and deaths”)

Tip 2: All places where “name” has been used please take care of the case sensitivity as per new kaggle data.

@David Thanks alot for the video.:)
Luv you!

Nilay Mohit says:

I have just started the with R and while watching the video and practicing along, I incurred with a doubt.

in video, syntax for duplicate names is as below,

dup.names < - data.combined[which(duplicated(data.combined$Name)),"Name"] instead of putting [ ] for "which" command, I used ( ) and got the error, > dup.names <- data.combined(which(duplicated(data.combined$Name)),"Name") Error in data.combined(which(duplicated(data.combined$Name)), "Name") : could not find function "data.combined" It made me more confused when to use [ ]. Thanks in advance.

Cecil Chams says:

hi david thanks for the video, i want to find out if you can help me apply the theil-sen equation in data analysis using R, i want to analyse landsat images and rainfall data to analyse the trends

Suraj Rasaq says:

WOW , This is the best Tutorial have seen in a long time, i have just been granted a place to study Master in Machine Learning and Data Mining, and i was looking for something to get me started before stumbling upon this stuff. Can you please do more stuff like this on Algorithms? This is great thanks a lot.

Xuwen Shen says:

Thanks so much! But I don’t see any predictions of the classification. The video only uses the train data to plot histograms.

Madeline Yau says:

I’m trying to get R but I am not able to on my Mac. I’m getting this error –> Unable to locate R binary by scanning standard locations.

Chuck Becker says:

Way, way too much about machine learning and way, way too little about what R is, how it handles data, and how to do that. This video is really, truly a waste of time for someone who wants to get to the destination of becoming proficient at analyzing data using R. Quit the philosophizing, get to the point … the point … the point!

Oluwasanmi Adenaiye says:

I love you.

Gopinath Puthumana says:

i am getting an error as below.

Error: Mapping must be created by `aes()` or `aes_()`

after running the below code , can any one help pls..

ggplot(train,aes(x = Pclass,fill=factor(Survived)))+
geom_histogram((width=0.5))+
xlab(“Pclass”)+ylab(“Total count”)+
labs(fill =”survived”)

Elisabeth Bird says:

Your common names, James Kelly and Kate Connelly, are Irish names originally. Ireland had extensive migrations especially during the potato famine in the 1940s, so yes they are also now common in the US and UK.

nitindec06 says:

This is philanthropy in the field of education. Thanks for
sharing your knowledge. I have gone through quite a few videos but YOUR WORK IS
SIMPLY OUTSTANDING. Thank you.

blazni says:

Hi, I have a sort of an unusual problem – When making a vector of repeated names, using the duplicate function I get an empty vector, it will not find any duplicate names. I made sure the code was spelled right, even copied it from the CVS file, also made sure the capitalizations were correct. My data combined is 1309 observations, and my length of the vector with unique names is 1307, so there are indeed two names repeated (I found them manually in the data.combined – they are the same as in the video). I have no idea what to try next. Help would be greatly appreciated. Thank you

Sidharth Moorthy says:

Why do we change the code for figuring out if the same pattern exists for males at 1:09:19?

I tried using the same code as that for Mrs. and Master for Mr. but that returned a table which also consisted of female data as in
mr < - data.combined[which(str_detect(data.combined$Sex, "male")),] > mr[1:5,]
PassengerId Survived Pclass
1 1 0 3
2 2 1 1
3 3 1 3
4 4 1 1
5 5 0 3
Name Sex
1 Braund, Mr. Owen Harris male
2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female
3 Heikkinen, Miss. Laina female
4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female
5 Allen, Mr. William Henry male

M JIANG says:

excellent tutorial video. very clear very detailed with backup documents learn a lot!

Lynette Cornell says:

I found this video while searching for info on R and was so happy to see how well it is done. Really enjoyed the step-by-step approach along with explanations for each component. I am tech-savvy but R is new to me, and this made the program very approachable. Thanks for such a great video! Also, your tone of voice and personality are perfect for these videos!

David Laguardia says:

I’ve got an error in the last graph:

Error in geom_bar(binwidth = 0.5) + facet_wrap(~Pclass) :
non-numeric argument to binary operator
In addition: Warning message:
`geom_bar()` no longer has a `binwidth` parameter. Please use `geom_histogram()` instead.

Maxwell Redeker says:

Just one quick question…I got to the end of the video, and after putting in the code to graph survival by title,
ggplot(data.combine [1:891,], aes(x=title, fill=Survived))+
geom_histogram(stat_count=0.5)+
facet_wrap(~Pclass)+
ggtitle(“Pclass”)+
xlab(“title”)+
ylab(“Total Count”)+
labs(fill=”Survived”)

I got the error saying, StatBin requires a continuous x variable the x variable is discrete. Perhaps you want stat=”count”?

So i Added stat=”count” after my xlab, and I got the error saying
Error in “count” + ylab(“Total Count”) :
non-numeric argument to binary operator

Please help me fix this

B Doogii says:

Hi Dear. Can you give me that file? I tried it so many times. But it doesn’t work. Please answer me 🙂

MUNIKUMAR N M says:

Hello David..
I’m following your method for TitanicDataAnalysis and i’m getting below error, can you please help me to solve this.
data.combined <- rbind(train,test.survived) Error in match.names(clabs, names(xi)) : names do not match previous names Thanks

eurasian :) says:

When I run the “data.combined[which(data.combined$Name %in% dup.names),]” part I just get:

# A tibble: 0 x 12
# … with 12 variables: PassengerId , Survived , Pclass , Name , Sex , Age , SibSp ,
# Parch
, Ticket , Fare , Cabin , Embarked

Anyone know what’s gone wrong?

Vasishta Polisetty says:

instead of grep can’t we just use str_detect?

Sanket Gadhave says:

Gone through all the video…only one word -> awesome
Thanks for sharing the knowledge.

Haider Imam says:

I keep getting the following error when I try combining train and test.survived “Error in match.names(clabs, names(xi)) :
names do not match previous names”
do you know why that would be ?

this is what I am writing “data.combined <- rbind(train, test.survived)"

Aashna Mohapatra says:

while writing the data.combined line, the console is showing an error of “names donot match the previous names” ,though i have written the names properly

Michael Rupp says:

Thanks for all your hard work on this. This is my first R tutorial.

21 52 says:

lol @ 14 yo mrs

Bryce Max says:

Best R intro I’ve seen to date. Thank you, David!

Saransh Rana says:

Error in aes(x = Pclass, fill = factor(Survived)) + geom_bar(width = 0.5) :
non-numeric argument to binary operator can any one help with this?

RichArts says:

Hi there, does anyone know how to get rid of the error: Mapping must be created using an aes () or aes_(). I typed everything as showed in the video…

And: Video is great, nice explanation, thanks a lot!!

ferenc feher says:

So, maybe this doesnt matter too much but I got a little confused when he was looking at titles, specifically males.

males<- data.combined[which(train%Sex == "male"),] Why does he initially reference the data.combined dataframe and then reference the train dataframe? What effect does this have and why did he break from referencing data.combined both times like he did with the female titles? misses <- data.combined[which(str_detect(data.combined$Name, "Miss.")),]

Nitesh Shelole says:

Thanks, David. Wonderful session. its really help me to understand R programming

Sidharth Moorthy says:

Also when we run the code to obtain the final plot for the Pclass, my plot doesn’t show the Mrs. bar. What could be the reason behind this?

Scott Wall says:

I’m running R Version 3.4.0 so maybe it’s an update issue but…When I run:
data.combined[which(data.combined$Name %in% dup.names),]I get:
# A tibble: 0 × 12
# … with 12 variables: PassengerId , Survived , Pclass , Name , Sex ,
#   Age , SibSp , Parch , Ticket , Fare , Cabin , Embarked Does anyone know how to correct this?

Sunil Kumar says:

i know i am late to put a query on this video, but i spend lots of time to correct my code for loop: can anyone help the same….i run the loop code as mentioned in video to for Titles and found below error related “i in 1:nrow” = Error in 1:nrow(data.combined) : argument of length 0

extractTitle < - function(name){ + name <- as.character(name) + + if(length(grep("Miss.", name)) > 0){
+ return(“Miss.”)
+ }else if (length(grep(“Mr.”, name)) > 0){
+ return(“Mr.”)
+ }else if (length(grep(“Mrs.”, name)) >0){
+ return(“Mrs.”)
+ }else if (length(grep(“Master.”, name)) >0){
+ retrun(“Master.”)
+ }else{
+ return(“Other”)
+ }
+ }
>
> titles < - NULL > for(i in 1 : nrow(data.combined)) {
+ titles <- c(titles, extractTitle(data.combined[i, "name"])) + } Error in 1:nrow(data.combined) : argument of length 0

David Jackson says:

Hey David, When I added “sex” and ran rf.8.preds <- predict(rf.8, test.submit.df) i get the following error: Error in predict.randomForest(rf.8, test.submit.df) : variables in the training data missing in newdata ====================================================================================== ### Add Sex DJJ ####### # Train a Random Forest using pclass, title, parch, & family.size rf.train.8 <- data.combined[1:891, c("pclass", "title", "sibsp", "family.size","sex")] set.seed(1234) rf.8 <- randomForest(x = rf.train.8, y = rf.label, importance = TRUE, ntree = 1000) rf.8 varImpPlot(rf.8) #============================================================================== # # Video #5 # #============================================================================== # Before we jump into features engineering we need to establish a methodology # for estimating our error rate on the test set (i.e., unseen data). This is # critical, for without this we are more likely to overfit. Let's start with a # submission of rf.5 to Kaggle to see if our OOB error estimate is accurate. # Subset our test records and features test.submit.df <- data.combined[892:1309, c("pclass", "title", "family.size","sex")] # Make predictions rf.8.preds <- predict(rf.8, test.submit.df)

 Write a comment

*

Do you like our videos?
Do you want to see more like that?

Please click below to support us on Facebook!