Data Mining

Data Mining is a multideciplinary field touching important topics across machine learning, natural language processing, information retrieval, and optimisation. This lecture is delivered in the second semester at the Department of Computer Science, The University of Liverpool as master level module. This module requires undergraduate-level knowledge in linear algebra, calculus and probability theory as mathematical preliminaries. All assignments must be submitted in Python 3.0 and the lab sessions will require prior knowledge in Python programming language.

Days and Locations

QA forum is available in Vital

Video streams are available in stream.liv

Lecture schedule and slides

# Date Title Slides
1. Jan 28 Introduction to Data Mining
2. Jan 30 Mathematical Preliminaries
3. Jan 31 Mathematical Preliminaries
4. Feb 4 Data Representation
5. Feb 6 Perceptron
6. Feb 7 Missing value handling, labeleing and noisy data
7. Feb 11 k-NN classifier
8. Feb 13 Classifier Evaluation
9. Feb 14 Naive Bayes classifier
10. Feb 18 Problem Set 1
11. Feb 20 Decision Tree Learner
12. Feb 21 Logistic regression.
13. Feb 25 Problem Set 2
14. Feb 27 Text mining. Part 1 (Bag-of-Words Model)
15. Feb 28 Text mining. Part 2 (POS tagging, NER)
16. March 4 Text mining. Part 3 (Relation Extraction)
17. March 6 k-means clustering
18. March 7 Cluster evaluation measures [above]
19. March 11 Dimensionality reduction (SVD)
20. March 13 Dimensionality reduction (PCA) [above]
21. March 14 Problem Set 3
22. March 18 Information Retrieval
23. March 20 Graph mining.
24. March 21 Neural networks
25. April 1 Deep Learning
26. April 3 Word Representations
27. April 4 Word Representations [above]
28. April 29 Sequential data
29. May 1 Revision-1
30. May 2 Revision-2

Assignment 1 (12% of course marks)

Assignment 2 (13% of course marks)

Final Exam (75% of course marks)

Resit Assignment 1

Resit Assignment 2

Resit Exam

Past Exams with Answers

Lab Sessions / Tutorials

The concepts that we will be learning in the lectures will be further developed using a series of programming tutorials. We will both implement some of the algorithms we learn in the course using Python as well as use some of the machine learning and data mining tools freely available. The two lab sessions are identical and you only need to attend one of the sessions per week. If your student number is even attend then attend the Monday lab session, else attend the Thursday lab session. Attendance marked for the lab sessions.

Location: Mondays 12:00-13:00 GHOLT-H105 (Lab 3)

Location: Thursdays 09:00-10:00 GHOLT-H105 (Lab 3)

Lab Tasks

Problem Sets

The following problem sets are for evaluating your understanding on the various topics that we have covered in the lectures. Try these by yourselves first. We will dicusss the solutions during the lectures and lab sessions later. You are not required to submit your solutions and they will not be marked or counting towards your final mark of the module. The problem sets are for self-assessment only.

References

Here is a list of useful references.

  1. Mathematics for Machine Learning Read Part I (in parcticular chapters 2, 3, 4, 5 and 6).

  2. Pattern Recognition and Machine Learning, by Chris Bishop. For machine learning related topics

  3. A Course in Machine Learning, by Hal Daume III. Excellent introductory material on various topics on machine learning.

  4. Data Mining: Practical Machine Learning Tools and Techniques by Ian Witten. For decision tree learners, associative rule mining, data pre-processing related topics.

  5. Foundations of Statistical Natural Language Processing by Christopher Manning. For text processing/mining related topics

  6. Introduction to Linear Algebra by Gilbert Strang is a good reference to brush up linear algebra related topics. MIT video lectures based on the book are also available

  7. An excellent reference of maths required for data mining and machine learning by Hal Daume III.

  8. numpy (Python numeric processing)

  9. scipy (Python MATLAB like functions)