SluitenHelpPrint
Switch to English
Cursus: INFOMDWR
INFOMDWR
Data wrangling and data analysis
Cursus informatie
CursuscodeINFOMDWR
Studiepunten (EC)14
Cursusdoelen
In this this course, you will learn to:
  1. Know, explain, and apply data retrieval from existing relational and nonrelational databases, including text, using queries build from primitives such as select, subset, and join both directly in, e.g., SQL and through a rjson interface.
  2. Know, explain, and apply common data clean-up procedures, including missing data and the appropriate imputation methods and feature selection.
  3. Know, explain, and apply methodology to properly set-up data analysis experiments, such as train, validate, and test and the bias/variance trade-off.
  4. Know, explain, and apply supervised machine learning algorithms, both for classification and regression purposes as well as their related quality measures, such as AUC and the confusion matrix.
  5. Know, explain, and apply non-supervised learning algorithms, such as clustering and other techniques that result in lower-dimensional data representations.
  6. Be able to choose between the different techniques learned in the course and be able to explain why the chosen technique fits both the data and the research question best.
Assessment
Your final grade in the course will be a weighted average of the following parts:
P1: weekly assignments (pass/fail): for each week there will be assignments containing both theoretical and programming questions. Timely and satisfactory completion of these assignments is required to pass the course.
P2: weekly exams (30%): each week, there will be a short exam of around 20 questions that should be answered in 20 minutes.
P3: final exam (70%): The final exam will include both theoretical and programming questions that cover all the material of the course.

To pass the course, student’s final grade should be greater than or equal to 5.5 with a precondition that the minimum grade for P2 and P3 is greater than 4.

A repair test requires at least a 4 for the original test.

 
Inhoud

Data do not fall from heaven, but are created, manipulated, transformed, and cleaned - in any data analysis, therefore, the treatment of the data itself is just as important as the modeling techniques applied to them.
In this course, you will learn to perform predictive data analysis to gain insights for science and business applications, while simultaneously keeping track of where these data originated and handling them yourself.
The course consists of two parts, data wrangling and data analysis, which are intertwined.
Each week, you will attend lectures and do a series of computer exercises, with a weekly exam to test your progress.

Course form
Lectures, tutorials, practicals.
Each week there are theoretical lectures that present the theories and give a general overview of the systems that are available. Then laboratory exercises and the tutorial sessions give a hands-on experience where the students can practice the theory on real-world applications. These laboratories and tutorial sessions are performed with the assistance of  teaching assistants or the professor.
The practical work done in these labs are drawn from some real life situations that allows the students to experience at first hand how to work data science problems. Due to the current situation, the lectures and the tutorials will be delivered online.

Literature
Tentative (can be changed during the course):

  1. James et al, "Introduction to Statistical Learning" http://www-bcf.usc.edu/~gareth/ISL/
  2. Grolund & Wickham, "R for Data Science"https://r4ds.had.co.nz/
  3. Janssen, "Data Science at the Command Line", https://www.datascienceatthecommandline.com/
  4. Abraham Silberschatz, Henry F. Korth, S. Sudarshan, "Database System Concepts"
  5. Wes McKinney, "Python for Data Analysis"
  6. Raghu Ramakrishnan, Johannes Gehrke "Database Management Systems"
  7. Bleifuß, Tobias, Sebastian Kruse, and Felix Naumann, "Efficient Denial Constraint Discovery with Hydra. Proceedings of the VLDB Endowment (PVLDB)". 11(3):311-323, 2017
  8. Loukides, M., "What is data science? The future belongs to the companies and people that turn data into products"
  9. Jiawei Han, Micheline Kamber, Jian Pei, "Data Mining: Concepts and Techniques"
  10. Ian H. Witten, Eibe Frank, "Data Mining: Practical Machine Learning Tools and Techniques"
  11. Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze, "An Introduction to information retrieval" https://nlp.stanford.edu/IR-book/pdf/irbookprint.pdf
  12. Jure Leskovec, Anand Rajaraman, Jeff Ullman, "Mining Massive Datasets" http://www.mmds.org
  13. Stef van Buuren, "Flexible Imputation of Missing Data" https://stefvanbuuren.name/fimd
  14. DL Oberski, "Mixture models: latent profile and latent class analysis" https://daob.nl/wp-content/uploads/2015/06/oberski-LCA.pdf
  15. Jurafsky, D., Martin, J.H.,  "Speech and language processing", third edition, Online chapters: https://web.stanford.edu/~jurafsky/slp3/
SluitenHelpPrint
Switch to English