SluitenHelpPrint
Switch to English
Cursus: INFOMDWR
INFOMDWR
Data wrangling and data analysis
Cursus informatie
CursuscodeINFOMDWR
Studiepunten (EC)14
Cursusdoelen
In this this course, you will learn to:
  1. Know, explain, and apply data retrieval from existing relational and nonrelational databases, including text, using queries build from primitives such as select, subset, and join both directly in, e.g., SQL and Python Pandas.
  2. Know, explain, and apply common data clean-up procedures, including missing data and the appropriate imputation methods and feature selection.
  3. Know, explain, and apply methodology to properly set-up data analysis experiments, such as train, validate, and test and the bias/variance trade-off.
  4. Know, explain, and apply supervised machine learning algorithms, both for classification and regression purposes as well as their related quality measures, such as AUC and the confusion matrix.
  5. Know, explain, and apply non-supervised learning algorithms, such as clustering and other techniques that result in lower-dimensional data representations.
  6. Be able to choose between the different techniques learned in the course and be able to explain why the chosen technique fits both the data and the research question best.
Assessment
Your final grade in the course will be a weighted average of the following parts:
  • P1. Weekly assignments (pass/fail): for each week there will be assignments containing both theoretical and programming questions. Timely and satisfactory completion of 7 out of 8 assignments is required to pass the course. Students who successfully pass 8 assignments will receive 0.5 points bonus on the final grade.
  • P2. Weekly exams (30%): each week, there will be a short exam of about 30 minutes.
  • P3. Final exam (70%): the final exam will include both theoretical and programming questions that cover all the material of the course.
To pass the course, student’s final grade should be greater than or equal to 5.5 with a precondition that the grade for each of P2 and P3 is greater than 4.
In addition, P1 should be satisfied.
Students with final grades between 4.0 and 5.4 will have a chance to attend the resit exam, which will be considered as a replacement for their final exam.
To be able to attend the resit exam, your grades for each of P2 and P3 should be greater than or equal to 4.

Prerequisites

This course is for students in the master Applied Data Science only.
Inhoud

Data do not fall from heaven, but are created, manipulated, transformed, and cleaned - in any data analysis, therefore, the treatment of the data itself is just as important as the modeling techniques applied to them.
In this course, you will get acquainted with and implement a variety of techniques to go from raw data to analyses, visualizations and insights for science and business applications.
This is an overview course designed to give you the tools and skills to use and evaluate data science methods.

Course form
Each week there are lectures that present the theories and give a general overview of the systems that are available.
Then laboratory exercises and the tutorial sessions give a hands-on experience where the students can practice the theory on real-world applications.
These laboratories and tutorial sessions are performed with the assistance of the teaching team.
The practical work done in these labs is drawn from real life situations that allow the students to experience how to solve data science problems.

Literature
Tentative (can be changed during the course):

  1. James et al, "Introduction to Statistical Learning" http://www-bcf.usc.edu/~gareth/ISL/
  2. Grolund & Wickham, "R for Data Science"https://r4ds.had.co.nz/
  3. Janssen, "Data Science at the Command Line", https://www.datascienceatthecommandline.com/
  4. Abraham Silberschatz, Henry F. Korth, S. Sudarshan, "Database System Concepts"
  5. Wes McKinney, "Python for Data Analysis"
  6. Raghu Ramakrishnan, Johannes Gehrke "Database Management Systems"
  7. Bleifuß, Tobias, Sebastian Kruse, and Felix Naumann, "Efficient Denial Constraint Discovery with Hydra. Proceedings of the VLDB Endowment (PVLDB)". 11(3):311-323, 2017
  8. Loukides, M., "What is data science? The future belongs to the companies and people that turn data into products"
  9. Jiawei Han, Micheline Kamber, Jian Pei, "Data Mining: Concepts and Techniques"
  10. Ian H. Witten, Eibe Frank, "Data Mining: Practical Machine Learning Tools and Techniques"
  11. Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze, "An Introduction to information retrieval" https://nlp.stanford.edu/IR-book/pdf/irbookprint.pdf
  12. Jure Leskovec, Anand Rajaraman, Jeff Ullman, "Mining Massive Datasets" http://www.mmds.org
  13. Stef van Buuren, "Flexible Imputation of Missing Data" https://stefvanbuuren.name/fimd
  14. DL Oberski, "Mixture models: latent profile and latent class analysis" https://daob.nl/wp-content/uploads/2015/06/oberski-LCA.pdf
  15. Jurafsky, D., Martin, J.H.,  "Speech and language processing", third edition, Online chapters: https://web.stanford.edu/~jurafsky/slp3/
SluitenHelpPrint
Switch to English