Part 1 in a in-depth hands-on tutorial introducing the viewer to Data Science with R programming. The video provides end-to-end data science training, including data exploration, data wrangling, data analysis, data visualization, feature engineering, and machine learning. All source code from videos are available from GitHub.
NOTE – The data for the competition has changed since this video series was started. You can find the applicable .CSVs in the GitHub repo.
Blog: http://daveondata.com
GitHub: https://github.com/EasyD/IntroToDataScience
I do Data Science training as a Bootcamp: https://goo.gl/OhIHSc
Amazon Auto Links: No products found.
Good Video
Can I get the data “test” and “train”. I want to practice those codes.
i beg you i touch your feet please tell about following in the big data : Introduction to “R”, analyzing and exploring data with “R”, statistics for model building and evaluation.
Hi I am new to data science and your explanation was very helpful for the beginners, Thank you very much!!!
I have one question, you have added a new column to the test and created new data object. my question why shouldn’t we add a new column to the test it self? reason is not to change the original object? which one is is faster operation adding new column or create a new object with new column added ?
You Sir are one of the best instructors I have ever met. The detailed explanation of each concept clears all doubt as soon as it arrives. And probably because of your video I might switch into this career path. So essentially, thanks in advance.
Trying to combine 2 data.frame on R with rbind the line of code is data.combined <- rbind(train, test.survived) getting a error in match.names(clabs, names(xi)) : names do not match previous names... How to fix this error ???
Tip1: In the last ggplot as the data to be plotted is restricted to 1:891 rows, the aesthetics parameter(x and fill) should also be restricted to 891 rows essentially.
Correct: ggplot(data.combined[1:891,],aes(x=titles[1:891],fill=Survived[1:891]))
geom_bar(width=0.4)+
facet_wrap(~Pclass)+
ggtitle(“Pclass”)+
xlab(“Title”)+
ylab(“count”)+
labs(“title wise survivals and deaths”)
Tip 2: All places where “name” has been used please take care of the case sensitivity as per new kaggle data.
@David Thanks alot for the video.:)
Luv you!
I have just started the with R and while watching the video and practicing along, I incurred with a doubt.
in video, syntax for duplicate names is as below,
dup.names < - data.combined[which(duplicated(data.combined$Name)),"Name"] instead of putting [ ] for "which" command, I used ( ) and got the error, > dup.names <- data.combined(which(duplicated(data.combined$Name)),"Name") Error in data.combined(which(duplicated(data.combined$Name)), "Name") : could not find function "data.combined" It made me more confused when to use [ ]. Thanks in advance.
hi david thanks for the video, i want to find out if you can help me apply the theil-sen equation in data analysis using R, i want to analyse landsat images and rainfall data to analyse the trends
WOW , This is the best Tutorial have seen in a long time, i have just been granted a place to study Master in Machine Learning and Data Mining, and i was looking for something to get me started before stumbling upon this stuff. Can you please do more stuff like this on Algorithms? This is great thanks a lot.
Thanks so much! But I don’t see any predictions of the classification. The video only uses the train data to plot histograms.
I’m trying to get R but I am not able to on my Mac. I’m getting this error –> Unable to locate R binary by scanning standard locations.
Way, way too much about machine learning and way, way too little about what R is, how it handles data, and how to do that. This video is really, truly a waste of time for someone who wants to get to the destination of becoming proficient at analyzing data using R. Quit the philosophizing, get to the point … the point … the point!
I love you.
i am getting an error as below.
Error: Mapping must be created by `aes()` or `aes_()`
after running the below code , can any one help pls..
ggplot(train,aes(x = Pclass,fill=factor(Survived)))+
geom_histogram((width=0.5))+
xlab(“Pclass”)+ylab(“Total count”)+
labs(fill =”survived”)
Your common names, James Kelly and Kate Connelly, are Irish names originally. Ireland had extensive migrations especially during the potato famine in the 1940s, so yes they are also now common in the US and UK.
This is philanthropy in the field of education. Thanks for
sharing your knowledge. I have gone through quite a few videos but YOUR WORK IS
SIMPLY OUTSTANDING. Thank you.
Hi, I have a sort of an unusual problem – When making a vector of repeated names, using the duplicate function I get an empty vector, it will not find any duplicate names. I made sure the code was spelled right, even copied it from the CVS file, also made sure the capitalizations were correct. My data combined is 1309 observations, and my length of the vector with unique names is 1307, so there are indeed two names repeated (I found them manually in the data.combined – they are the same as in the video). I have no idea what to try next. Help would be greatly appreciated. Thank you
Why do we change the code for figuring out if the same pattern exists for males at 1:09:19?
I tried using the same code as that for Mrs. and Master for Mr. but that returned a table which also consisted of female data as in
mr < - data.combined[which(str_detect(data.combined$Sex, "male")),] > mr[1:5,]
PassengerId Survived Pclass
1 1 0 3
2 2 1 1
3 3 1 3
4 4 1 1
5 5 0 3
Name Sex
1 Braund, Mr. Owen Harris male
2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female
3 Heikkinen, Miss. Laina female
4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female
5 Allen, Mr. William Henry male
excellent tutorial video. very clear very detailed with backup documents learn a lot!
I found this video while searching for info on R and was so happy to see how well it is done. Really enjoyed the step-by-step approach along with explanations for each component. I am tech-savvy but R is new to me, and this made the program very approachable. Thanks for such a great video! Also, your tone of voice and personality are perfect for these videos!
I’ve got an error in the last graph:
Error in geom_bar(binwidth = 0.5) + facet_wrap(~Pclass) :
non-numeric argument to binary operator
In addition: Warning message:
`geom_bar()` no longer has a `binwidth` parameter. Please use `geom_histogram()` instead.
Just one quick question…I got to the end of the video, and after putting in the code to graph survival by title,
ggplot(data.combine [1:891,], aes(x=title, fill=Survived))+
geom_histogram(stat_count=0.5)+
facet_wrap(~Pclass)+
ggtitle(“Pclass”)+
xlab(“title”)+
ylab(“Total Count”)+
labs(fill=”Survived”)
I got the error saying, StatBin requires a continuous x variable the x variable is discrete. Perhaps you want stat=”count”?
So i Added stat=”count” after my xlab, and I got the error saying
Error in “count” + ylab(“Total Count”) :
non-numeric argument to binary operator
Please help me fix this
Hi Dear. Can you give me that file? I tried it so many times. But it doesn’t work. Please answer me 🙂
Hello David..
I’m following your method for TitanicDataAnalysis and i’m getting below error, can you please help me to solve this.
data.combined <- rbind(train,test.survived) Error in match.names(clabs, names(xi)) : names do not match previous names Thanks
When I run the “data.combined[which(data.combined$Name %in% dup.names),]” part I just get:
# A tibble: 0 x 12, Survived , Pclass , Name , Sex , Age , SibSp , , Ticket , Fare , Cabin , Embarked
# … with 12 variables: PassengerId
# Parch
Anyone know what’s gone wrong?
instead of grep can’t we just use str_detect?
Gone through all the video…only one word -> awesome
Thanks for sharing the knowledge.
I keep getting the following error when I try combining train and test.survived “Error in match.names(clabs, names(xi)) :
names do not match previous names”
do you know why that would be ?
this is what I am writing “data.combined <- rbind(train, test.survived)"
while writing the data.combined line, the console is showing an error of “names donot match the previous names” ,though i have written the names properly
Thanks for all your hard work on this. This is my first R tutorial.
lol @ 14 yo mrs
Best R intro I’ve seen to date. Thank you, David!
Error in aes(x = Pclass, fill = factor(Survived)) + geom_bar(width = 0.5) :
non-numeric argument to binary operator can any one help with this?
Hi there, does anyone know how to get rid of the error: Mapping must be created using an aes () or aes_(). I typed everything as showed in the video…
And: Video is great, nice explanation, thanks a lot!!
So, maybe this doesnt matter too much but I got a little confused when he was looking at titles, specifically males.
males<- data.combined[which(train%Sex == "male"),] Why does he initially reference the data.combined dataframe and then reference the train dataframe? What effect does this have and why did he break from referencing data.combined both times like he did with the female titles? misses <- data.combined[which(str_detect(data.combined$Name, "Miss.")),]
Thanks, David. Wonderful session. its really help me to understand R programming
Also when we run the code to obtain the final plot for the Pclass, my plot doesn’t show the Mrs. bar. What could be the reason behind this?
I’m running R Version 3.4.0 so maybe it’s an update issue but…When I run:, Survived , Pclass , Name , Sex ,, SibSp , Parch , Ticket , Fare , Cabin , Embarked Does anyone know how to correct this?
data.combined[which(data.combined$Name %in% dup.names),]I get:
# A tibble: 0 × 12
# … with 12 variables: PassengerId
# Age
i know i am late to put a query on this video, but i spend lots of time to correct my code for loop: can anyone help the same….i run the loop code as mentioned in video to for Titles and found below error related “i in 1:nrow” = Error in 1:nrow(data.combined) : argument of length 0
extractTitle < - function(name){ + name <- as.character(name) + + if(length(grep("Miss.", name)) > 0){
+ return(“Miss.”)
+ }else if (length(grep(“Mr.”, name)) > 0){
+ return(“Mr.”)
+ }else if (length(grep(“Mrs.”, name)) >0){
+ return(“Mrs.”)
+ }else if (length(grep(“Master.”, name)) >0){
+ retrun(“Master.”)
+ }else{
+ return(“Other”)
+ }
+ }
>
> titles < - NULL > for(i in 1 : nrow(data.combined)) {
+ titles <- c(titles, extractTitle(data.combined[i, "name"])) + } Error in 1:nrow(data.combined) : argument of length 0
Hey David, When I added “sex” and ran rf.8.preds <- predict(rf.8, test.submit.df) i get the following error: Error in predict.randomForest(rf.8, test.submit.df) : variables in the training data missing in newdata ====================================================================================== ### Add Sex DJJ ####### # Train a Random Forest using pclass, title, parch, & family.size rf.train.8 <- data.combined[1:891, c("pclass", "title", "sibsp", "family.size","sex")] set.seed(1234) rf.8 <- randomForest(x = rf.train.8, y = rf.label, importance = TRUE, ntree = 1000) rf.8 varImpPlot(rf.8) #============================================================================== # # Video #5 # #============================================================================== # Before we jump into features engineering we need to establish a methodology # for estimating our error rate on the test set (i.e., unseen data). This is # critical, for without this we are more likely to overfit. Let's start with a # submission of rf.5 to Kaggle to see if our OOB error estimate is accurate. # Subset our test records and features test.submit.df <- data.combined[892:1309, c("pclass", "title", "family.size","sex")] # Make predictions rf.8.preds <- predict(rf.8, test.submit.df)