Key information

Tutor: Dr Liam Gao
Duration: 2 x 2-hour sessions
Delivery: Live (In-Person)
Course Credit (PGR only): 1 credit 
Audience: Research Degree Students, Postdocs, Research Fellows

Dates

  • 15 & 18 December 2025
    11:00-12:30, South Kensington

Current scientific problems often involve processing of big data. One of the most popular statistical tools, R, is often used by students and researchers as a platform for such data processing. However, when you work with increasingly larger datasets, you may run into scalability problems. The standard data processing tools available in R often struggle to deal with such datasets on local desktop or laptop computers, owing to software and hardware limitations. On the software side, the data structures are normally not optimized for big data. On the hardware side, main memory requirements far exceed a conventional hardware capability.

Fortunately, you can greatly reduce these problems by making use of some specific R packages. Three R packages - data.table, dtplyr, and sparklyr - expand your capability to work with big data. These packages also provide new methods to boost the capacity of local computers and computational performance when dealing with huge datasets in csv or a relational database format. You will learn the basic skills of manipulating big datasets in R/RStudio for prototyping data analysis.

The workshop will be delivered through a combination of slides, live demonstrations and hands-on practice

Syllabus

  • What is big data?
  • The data.table package and performance comparison with the widely used data.frame
  • Dataframe manipulation with dtplyr
  • Interfacing with Apache Spark in R (sparklyr)
  • Leveraging the pipe operator in data analysis

Learning Outcomes

On completion of this workshop, you will be better able to:

  • Use the data.frame, dtplyr and sparklyr packages in RStudio or R to load big datasets more efficiently
  • Manage data types by choosing the appropriate functionality from big data processing packages
  • Compose code for cleaning and analysing big data using the packages
  • Integrate external tools such as a relational database system or Apache Spark to your analysis

Pre-requisites

You will not be able to follow the course without these essential requisites:

  1. Basic knowledge of R programming language
  2. Pipe operator (learning materials are available at https://github.com/ImperialCollegeLondon/RCDS-data-processing-with-r/blob/master/data_processing_with_R_2.Rmd)
  3. Basic SQL languages (learning materials are available at http://swcarpentry.github.io/sql-novice-survey/

How to book

 

Please ensure you have read and understood ECRI’s cancellation policy before booking