Data Mining

Data Mining is a multideciplinary field touching important topics across machine learning, natural language processing, information retrieval, and optimization. This lecture is delivered in the second semester at the Department of Computer Science, The University of Liverpool as master level module.

Days and Locations

Lecture schedule and slides

-->
# Date Title slides videos
1. Jan 30 Introduction to Data Mining (Problem Set 0)
2. Feb 1 Data representation
5. Feb 2 Perceptron
3. Feb 13 Missing value handling, labeleing and noisy data
4. Feb 15 k-NN classifier
6. Feb 16 Problem Set 1
7. Feb 16 Classifier Evaluation
8. Feb 20 Naive Bayes classifier
9. Feb 22 Decision Tree Learner
10. Feb 23 Logistic regression.
12. Feb 23 Problem Set 2
13. Feb 27 Text mining. Part 1
14. March 1 Text mining. Part 2
15. March 2 Text mining. Part 3
16. March 6 k-means clustering
17. March 8 Cluster evaluation measures [above]
18. March 9 Support Vector Machines. Part 1
19. March 13 Support Vector Machines. Part 2[above]
20. March 15 Support Vector Machines. Part 3 [above]
20. March 16 Support Vector Machines. [problems] [above]
21. April 10 Dimensionality reduction (SVD)
22. April 12 Dimensionality reduction (PCA) [above]
23. April 13 Problem Set 3
24. April 17 Information Retrieval
25. April 19 Graph mining.
26. April 20 Spectral clustering.
27. April 26 Neural networks and Deep Learning. Part 1
27. April 27 Neural networks and Deep Learning. Part 2
28. May 1 Sequential data
29. May 3 Word Representations
30. May 4 Word Representations [above]
31. May 10 Revision
32. May 11 Revision

Resit Assignment 1

Resit Assignment 2

Resit Exam

  • Exam date: TBD

  • See below for past resit questions and answers. Please use these in your revision for the resit.

Past Exams with Answers

Lab Sessions / Tutorials

The concepts that we will be learning in the lectures will be further developed using a series of programming tutorials. We will both implement some of the algorithms we learn in the course using Python as well as use some of the machine learning and data mining tools freely available. The two lab sessions are identical and you only need to attend one of the sessions per week. If your student number is even attend the Thursday session, else attend the Friday session. Attendance is not marked for the lab sessions, which are optional.

Location: Thursdays 13:00-14:00 GHOLT-H105 (Lab 3)

Location: Fridays 09:00-10:00 GHOLT-H105 (Lab 3)

Lab Tasks

  • Python basics

    Following the notebook, try various data structures, functions and classes from Python. You can refer Official Tutorial, excellent and biref overview for further details. You can get IPython shell

    Task: Write a python program to measure the similarity between two given sentences. First, compute the set of words in each sentence, and then use the Jaccard coefficient to measure the similarity between the two given sentences. You can see the solution here.

    The sample code we used in the lab is here

  • Numpy basics

    Numpy is a python library that provides data structures useful for data mining such as arrays, and various functions on those data structures. Follow the official numpy tutorial and familiarize yourself with numpy.

    Task: Using numpy measure the cosine similarity between two given sentences. You can see the solution here [HTML] [notebook].

  • Data preprocessing

    Lets perform various preprocessing steps on a set of feature vectors such as L1/L2 normalization and [0,1] scaling, Gaussian scaling using numpy. You can download the [notebook] and the [HTML] versions for this task.

  • Implement a k-NN Classifier

    Lets implement a k-NN classifier to perform binary classification and evaluate its accuracy. You can download the [notebook] and the [HTML] versions for this task.

  • Naive Bayes Classifier

    Lets implement a naive Bayes classifier to classify the data used in CA1. You can download a sample program here [Naive Bayes] Uncompress the zip archive and run the python program to obtain the classification accuracy on test data. Modify the code and see what happens if we do not use Laplace smoothing.

  • Co-occurrence Measures

    We will count the co-occurrences of words in a given collection of documents, and compute pointwise mutual information, chi-squared measure, and the log-likelihood ratio. See the lecture notes on text mining for the definition of those word association measures. The sample program that implements the pointwise mutual information can be downloaded from here. When you uncompress the archive you will find two files: corpus.txt (contains 2000 sentences one per each line, we will assume each line to be a co-occurrence window) and cooc.py (computes the co-occurrences between words from corpus.txt and prints the top-ranked word-pairs in terms of their pointwise mutual information values). You are required to implement the Chi-squared measure and the log-likelihood ratio. There is a larger 100,000 sentence corpus (large_corpus.txt) also provided if you wish to experiment with larger text datasets.

  • Principal Component Analysis

    Let us implement PCA and project some data points from high dimensional space to low dimensional space. You can download the [notebook] and the [HTML] versions for this task.

  • Build your own search engine!

    Let us build an inverted index and process queries containing two keywords. Download and unzip this file and run the search.py script. It will index the 20 newsgroups dataset. We will assign integer ids to file names and sort the posting lists. Next, issue a query at the prompt containing two words separated by a space. We will then find the corresponding posting lists and find their intersection. Once you have carefully studied the behaviour of the matching algorithm (uncomment the print statements inside get_results function if needed), extend search.py to handle (a) queries containing arbitrary number of keywords and (b) use skip-pointers to speed up the look up.

  • Train a multi-layer feed-forward neural network

    Download the source code and data and study NN.py. This script implements a multilayer neural network with a softmax output and a single hidden layer. Hyperbolic tangent (tanh) activation is used in the hidden layer. The NN classifies the Iris dataset that you used in CA1.

Problem Sets

The following problem sets are for evaluating your understanding on the various topics that we have covered in the lectures. Try these by yourselves first. We will dicusss the solutions during the lectures and lab sessions later. You are not required to submit your solutions and they will not be marked or counting towards your final mark of the module. The problem sets are for self-assessment only.

MSc projects

The following summer MSc projects are available to CS students at UoL. If you are interested please contact me.

  • Exploring the World of Deep Learning through GANs

    Deep learning has received much attention lately due to its impressive performance in various real-world applications. Deep Learning systems have already outperformed humans in various tasks. For example, in Large Scale Visual Recognition Challenge (ILSVRC) every year, deeper and complex neural network models have outperformed humans in recognising objects in images. Machine translation systems such as Google’s machine translation and Microsoft’s real time voice-to-voice translation system are powered by bi-directional sequence-to-sequence models and neural language models. In this project, we will explore the frontiers of deep learning. Specifically, we will explore the power of a recently proposed deep learning architecture called Generative Adversarial Networks or GANs. A GAN consists of two components: a discriminative model and a generative model. The generative model (counterfeit money producer) would like to generate data that can fool the discriminator (police) and the discriminator would like to separate actual data from the noise. By optimising both discriminator and the generator jointly, we will be able to learn a highly accurate discriminative model at the sametime generate new training data using the generative model. An example use of GAN to generate images can be seen here In this MSc project, we will implement two variants of GANs, f-GAN and WGAN, and compare their performance in an NLP task.

  • Cross-lingual Translation Quality Analysis

    The world is full of languages and unfortunately the availability of information is unequal across different languages. Some information might be available only in a particular language, which might not be understood by non-speakers of that language. Machine translation (MT) has emerged as an attractive solution to this language barrier to information access. Machine translation systems that can accurately and efficiently translate documents across a wide range of languages have been developed such as Neural Machine Translation (NMT) used in Google MT. Unfortunately, MT systems are not yet perfect, and humans too need to be in the translation loop. In this project, we will consider the problem of automatically evaluating the quality of a human translated text using bilingual word embeddings. The project is related to an ongoing industrial collaboration here at UoL, and if successful, you will have the opportunity to contribute to a system that will be used by millions of users across the world!

References

There is no specific official text book for this course. The following is a recommended list of text books, papers, web sites for the various topics covered in this course.

  1. Pattern Recognition and Machine Learning, by Chris Bishop. For machine learning related topics

  2. A Course in Machine Learning, by Hal Daume III. Excellent introductory material on various topics on machine learning.

  3. Data Mining: Practical Machine Learning Tools and Techniques by Ian Witten. For decision tree learners, associative rule mining, data pre-processing related topics.

  4. Foundations of Statistical Natural Language Processing by Christopher Manning. For text processing/mining related topics

  5. Introduction to Linear Algebra by Gilbert Strang is a good reference to brush up linear algebra related topics. MIT video lectures based on the book are also available

  6. An excellent reference of maths required for data mining and machine learning by Hal Daume III.

  7. numpy (Python numeric processing)

  8. scipy (Python MATLAB like functions)

  9. LIBSVM (SVM library available written in C and with bindings for numerous languages including Python)

  10. scikit-learn (Machine Learning in Python)