Data Mining is a multideciplinary field touching important topics across machine learning, natural language processing, information retrieval, and optimization. This lecture is delivered in the second semester at the Department of Computer Science, The University of Liverpool as master level module.
Mondays 09:00-10:00 BROD-406/406a (building 233, Map C8)
Wednesdays 11:00-12:00 LIFS-SR2
Fridays 11:00-12:00 SCTH-SR6 (building 120, Map E2)
# | Date | Title | slides |
1. | Jan 30 | Introduction to Data Mining | |
2. | Feb 1 | Data representation | |
3. | Feb 3 | Missing value handling, labeleing and noisy data | |
4. | Feb 6 | k-NN classifier | |
5. | Feb 8 | Perceptron | |
6. | Feb 10 | Classifier Evaluation | |
7. | Feb 13 | Decision Tree Learner | |
8. | Feb 15 | Naive Bayes classifier | |
9. | Feb 17 | Logistic regression. Part 1 | |
10. | Feb 20 | Logistic regression. Part 2 | [above] |
11. | Feb 22 | Support Vector Machines. Part 1 | |
12. | Feb 24 | Support Vector Machines. Part 2 | [above] |
13. | Feb 27 | Support Vector Machines. Part 3 | [above] |
14. | March 1 | k-means clustering | |
15. | March 3 | Cluster evaluation measures | [above] |
16. | March 6 | Text mining. Part 1 | |
17. | March 8 | Text mining. Part 2 | [above] |
18. | March 10 | Information Retrieval | |
19. | March 13 | Graph mining. Part 1 | |
20. | March 15 | Graph mining. Part 2 | [code] |
21. | March 17 | Dimensionality reduction (SVD) | |
22. | March 20 | Dimensionality reduction (PCA) | [above] |
23. | March 22 | Data visualization | |
24. | March 24 | Data visualization | [above] |
25. | March 27 | Neural networks and Deep Learning. Part 1 | |
26. | March 29 | Neural networks and Deep Learning. Part 2 | |
27. | April 24 | Sequential data | |
28. | April 26 | Word Representations | |
30. | April 28 | Revision | |
31. | May 3 | Revision |
Release date: February 8
Submission deadline: March 10 15:00 HRS
Release date: March 1
Implementing the k-means Clustering Algorithm Download CA2data.txt
Submission deadline: March 31 15:00 HRS
Exam date: TBD
For your reference, 2016 Final and Resit Exam papers with answers are available here.
Release date: July 14
Submission deadline: August 4 15:00 HRS (via Email to danushka)
Release date: July 14
Implementing Hierarchical Clustering Algorithm Download CA2data.txt
Submission deadline: August 11 15:00 HRS (via Email to danushka)
Exam date: TBD
For your reference, 2016 Final and Resit Exam papers with answers are available here.
The concepts that we will be learning in the lectures will be further developed using a series of programming tutorials. We will both implement some of the algorithms we learn in the course using Python as well as use some of the machine learning and data mining tools freely available. Pavithra Rajendran and Xia Cui will be your TAs.
Following the notebook, try various data structures, functions and classes from Python. You can refer Official Tutorial, excellent and biref overview for further details. You can get IPython shell
Task: Write a python program to measure the similarity between two given sentences. First, compute the set of words in each sentence, and then use the Jaccard coefficient to measure the similarity between the two given sentences. You can see the solution here.
The sample code we used in the lab is here
Numpy is a python library that provides data structures useful for data mining such as arrays, and various functions on those data structures. Follow the official numpy tutorial and familiarize yourself with numpy.
Task: Using numpy measure the cosine similarity between two given sentences. You can see the solution here [HTML] [notebook].
Lets implement a k-NN classifier to perform binary classification and evaluate its accuracy. You can download the [notebook] and the [HTML] versions for this task.
Lets perform various preprocessing steps on a set of feature vectors such as L1/L2 normalization and [0,1] scaling, Gaussian scaling using numpy. You can download the [notebook] and the [HTML] versions for this task.
Lets implement a naive Bayes classifier to classify the data used in CA1. You can download a sample program here [Naive Bayes] Uncompress the zip archive and run the python program to obtain the classification accuracy on test data. Modify the code and see what happens if we do not use Laplace smoothing.
We will count the co-occurrences of words in a given collection of documents, and compute pointwise mutual information, chi-squared measure, and the log-likelihood ratio. See the lecture notes on text mining for the definition of those word association measures. The sample program that implements the pointwise mutual information can be downloaded from here. When you uncompress the archive you will find two files: corpus.txt (contains 2000 sentences one per each line, we will assume each line to be a co-occurrence window) and cooc.py (computes the co-occurrences between words from corpus.txt and prints the top-ranked word-pairs in terms of their pointwise mutual information values). You are required to implement the Chi-squared measure and the log-likelihood ratio. There is a larger 100,000 sentence corpus (large_corpus.txt) also provided if you wish to experiment with larger text datasets.
Let us implement PCA and project some data points from high dimensional space to low dimensional space. You can download the [notebook] and the [HTML] versions for this task.
The following problem sets are for evaluating your understanding on the various topics that we have covered in the lectures. Try these by yourselves first. We will dicusss the solutions during the lectures and lab sessions later. You are not required to submit your solutions and they will not be marked or counting towards your final mark of the module. The problem sets are for self-assessment only.
Problem Set 0 is out.
Problem Set 1 is out.
Problem Set 2 is out.
Problem Set 3 is out.
The following summer MSc projects are available to CS students at UoL. If you are interested please contact me.
Deep learning has received much attention lately due to its impressive performance in various real-world applications. Deep Learning systems have already outperformed humans in various tasks. For example, in Large Scale Visual Recognition Challenge (ILSVRC) every year, deeper and complex neural network models have outperformed humans in recognising objects in images. Machine translation systems such as Google’s machine translation and Microsoft’s real time voice-to-voice translation system are powered by bi-directional sequence-to-sequence models and neural language models. In this project, we will explore the frontiers of deep learning. Specifically, we will explore the power of a recently proposed deep learning architecture called Generative Adversarial Networks or GANs. A GAN consists of two components: a discriminative model and a generative model. The generative model (counterfeit money producer) would like to generate data that can fool the discriminator (police) and the discriminator would like to separate actual data from the noise. By optimising both discriminator and the generator jointly, we will be able to learn a highly accurate discriminative model at the sametime generate new training data using the generative model. An example use of GAN to generate images can be seen here In this MSc project, we will implement two variants of GANs, f-GAN and WGAN, and compare their performance in an NLP task.
Emotion detection, sentiment analysis and the like have emerged as a core part of the real world tasks currently expected of natural language processing (NLP) systems and services being used in production. However, such tasks effectively attempt to replicate human knowledge, immediately exposing an inherent flaw within such systems, that they are only as good as the models we can build of such knowledge. While there are various attempts to formalise such models (e.g. rule-based accounts), over the last decade data-oriented modelling approaches are emerging as more reliable and accurate, with wider coverage and scalability. Within the field of Computational Semantics, vector space models (VSMs) of linguistic meaning have emerged as a robust data-oriented approach to building models of such knowledge. However, a key bottleneck for all applications targeting phenomena such as emotion, mood, sentiment and the like, is the quality of the data used to model the underlying human knowledge. Bringing both threads together, we aim to employ VSMs to synthesise this knowledge, by bootstrapping on currently available psycholinguistically validated collections of emotion and affective meanings for selected words. For example, beginning with a collection, which is essentially a list of words and their affective meanings (presented as scores collected during psycholinguistics studies), we proceed to develop VSMs for these words, via deep learning, essentially generating a network of words related to the original list of words, then propagate the scores from the original list throughout this newly formed network. This approach will provide us with a powerful approach to essentially being able to synthesise all manner of the deeper, more cognitive dimensions of the natural language semantics.
There is no specific official text book for this course. The following is a recommended list of text books, papers, web sites for the various topics covered in this course.
Pattern Recognition and Machine Learning, by Chris Bishop. For machine learning related topics
A Course in Machine Learning, by Hal Daume III. Excellent introductory material on various topics on machine learning.
Data Mining: Practical Machine Learning Tools and Techniques by Ian Witten. For decision tree learners, associative rule mining, data pre-processing related topics.
Foundations of Statistical Natural Language Processing by Christopher Manning. For text processing/mining related topics
numpy (Python numeric processing)
scipy (Python MATLAB like functions)
LIBSVM (SVM library available written in C and with bindings for numerous languages including Python)
scikit-learn (Machine Learning in Python)