Key information
Tutor: Dr Liam Gao
Duration: 2 x 2-hour sessions
Delivery: Live (In-Person)
Course Credit (PGR only): 1 credit
Audience: Research Degree Students, Postdocs, Research Fellows
Dates
- 15 & 18 December 2025
11:00-12:30, South Kensington
Course Resources
Current scientific problems often involve processing of big data. One of the most popular statistical tools, R, is often used by students and researchers as a platform for such data processing. However, when you work with increasingly larger datasets, you may run into scalability problems. The standard data processing tools available in R often struggle to deal with such datasets on local desktop or laptop computers, owing to software and hardware limitations. On the software side, the data structures are normally not optimized for big data. On the hardware side, main memory requirements far exceed a conventional hardware capability.
Fortunately, you can greatly reduce these problems by making use of some specific R packages. Three R packages - data.table, dtplyr, and sparklyr - expand your capability to work with big data. These packages also provide new methods to boost the capacity of local computers and computational performance when dealing with huge datasets in csv or a relational database format. You will learn the basic skills of manipulating big datasets in R/RStudio for prototyping data analysis.
The workshop will be delivered through a combination of slides, live demonstrations and hands-on practice
Syllabus
- What is big data?
- The data.table package and performance comparison with the widely used data.frame
- Dataframe manipulation with dtplyr
- Interfacing with Apache Spark in R (sparklyr)
- Leveraging the pipe operator in data analysis
Learning Outcomes
On completion of this workshop, you will be better able to:
- Use the data.frame, dtplyr and sparklyr packages in RStudio or R to load big datasets more efficiently
- Manage data types by choosing the appropriate functionality from big data processing packages
- Compose code for cleaning and analysing big data using the packages
- Integrate external tools such as a relational database system or Apache Spark to your analysis
Pre-requisites
You will not be able to follow the course without these essential requisites:
- Basic knowledge of R programming language
- Pipe operator (learning materials are available at https://github.com/ImperialCollegeLondon/RCDS-data-processing-with-r/blob/master/data_processing_with_R_2.Rmd)
- Basic SQL languages (learning materials are available at http://swcarpentry.github.io/sql-novice-survey/
How to book
- Early Career Researchers (Research Degree Students, Postdocs, Research Fellows) should book via Inkpath using your Imperial Single-Sign-On.
- All other members of the Imperial community, should book here.
Please ensure you have read and understood ECRI’s cancellation policy before booking.