Discover the power of the data frame in R!
Join DataCamp today, and start our interactive intro to R programming tutorial for free: https://www.datacamp.com/courses/free-introduction-to-r
By now, you already learned quite some things in R. Data structures such as vectors, matrices and lists have no secrets for you anymore. However, R is a statistical programming language, and in statistics you’ll often be working with data sets. Such data sets are typically comprised of observations, or instances. All these observations have some variables associated with them. You can have for example, a data set of 5 people. Each person is an instance, and the properties about these people, such as for example their name, their age and whether they have children are the variables. How could you store such information in R? In a matrix? Not really, because the name would be a character and the age would be a numeric, these don’t fit in a matrix. In a list maybe? This could work, because you can put practically anything in a list. You could create a list of lists, where each sublist is a person, with a name, an age and so on. However, the structure of such a list is not really useful to work with. What if you want to know all the ages for example? You’d have to write a lot of R code just to get what you want. But what data structure could we use then?
Meet the data frame. It’s the fundamental data structure to store typical data sets. It’s pretty similar to a matrix, because it also has rows and columns. Also for data frames, the rows correspond to the observations, the persons in our example, while the columns correspond to the variables, or the properties of each of these persons. The big difference with matrices is that a data frame can contain elements of different types. One column can contain characters, another one numerics and yet another one logicals. That’s exactly what we need to store our persons’ information in the dataset, right? We could have a column for the name, which is character, one for the age, which is numeric, and one logical column to denote whether the person has children.
There still is a restriction on the data types, though. Elements in the same column should be of the same type. That’s not really a problem, because in one column, the age column for example, you’ll always want a numeric, because an age is always a number, regardless of the observation.
So, for the practical part now: creating a data.frame. In most cases, you don’t create a data frame yourself. Instead, you typically import data from another source. This could be a csv file, a relational database, but also come from other software packages like Excel or SPSS.
Of course, R provides ways to manually create data frames as well. You use the data dot frame function for this. To create our people data frame that has 5 observations and 3 variables, we’ll have to pass the data frame function 3 vectors that are all of length five. The vectors you pass correspond to the columns. Let’s create these three vectors first: `name`, `age` and `child`.
Now, calling the data frame function is simple:
The printout of the data frame already shows very clearly that we’re dealing with a data set. Notice how the data frame function inferred the names of the columns from the variable names you passed it. To specify the names explicitly, you can use the same techniques as for vectors and lists. You can use the names function, … , or use equals sings inside the data frame function to name the data frame columns right away.
Like in matrices, it’s also possible to name the rows of the data frame, but that’s generally not a good idea so I won’t detail on that here.
Before you head over to some exercises, let me shortly discuss the structure of a data frame some more.
If you look at this structure, …, there are two things you can see here: First, the printout looks suspiciously similar to that of a list. That’s because, under the hood, the data frame actually is a list. In this case, it’s a list with three elements, corresponding to each of the columns in the data frame. Each list element is a vector of length 5, corresponding to the number of observations. A requirement that is not present for lists is that the length of the vectors you put in the list has to be equal. If you try to create a data frame with 3 vectors that are not all of the same length, you’ll get an error.
Second, the name column, which you expect to be a character vector, is actually a factor. That’s because R by default stores the strings as factors. To suppress this behaviour, you can set the stringsAsFactors argument of the data.frame function to FALSE
Now, the name column actually contains characters.
With this new knowledge, you’re ready for some first exercises on this extremely useful and powerful data structure.