Adding R to the Data Toolkit

I’ve officially jumped on the R bandwagon. I worked on a project last year for which R turned out to be the best solution to tackling a lot of messy data (OpenRefine was not reproducible enough and let’s not even talk about the disaster that was Access). Since then, I’ve thrown other data at R and now consider it as part of my regular suite of data tools.

I want to emphasize that last point that R is just one piece in the data toolkit. Software like R has a steep learning curve if you’ve never programmed before. There are other tools, like OpenRefine, that get the job done and are friendlier to the average user. But for processing large amounts of data in a reproducible way, R is definitely worth learning. (Here’s roughly how I break my data needs down: Excel is for everyday data work; OpenRefine is for one-off data cleaning; and R is for large scale/reproducible data cleaning and processing.)

So if you find yourself with a lot of data to process, I have some tips for learning R:

Run R in RStudio.
- It takes a little effort to learn the RStudio interface but it will be a better experience if you’re not used to the command line (base R).
Have a problem to solve.
- Learning a programming language is always easier if you have a specific task to accomplish.
Take advantage of existing resources.
- I watched a lot of Lynda videos to learn R (though be aware that Lynda is working through some privacy issues).
- I really love the book “R for Data Science“.
- I also cannot recommend RStudio’s cheat sheets enough.

Finally, I should say that I’m a patron of the Tidyverse, which is a flavor of R that comes with its own tools and methods for data handling. The Tidyverse makes data cleaning easy but you do have to organize your data in a particular way, with columns as variables and rows as individual observations. Tidy data is not condensed data and usually leads to a few columns with rows and rows of data, but this formatting enables streamlined processing. It’s not necessary to use the Tidyverse to use R, but it can be quite useful.

R is not the most efficient way to solve every data problem and it takes time to learn, but I think there is an advantage to learning a language like R (or Python or…) if you have serious data manipulation needs. Does it have a place in your data toolkit?

Data Ab Initio

Managing research data right, from the start

Adding R to the Data Toolkit

One Response to Adding R to the Data Toolkit

Leave a Reply Cancel reply

Search

Recent Posts

Archives

Categories

Meta