Data Mining is a multideciplinary field touching important topics across machine learning, natural language processing, information retrieval, and optimization. This lecture is delivered in the second semester at the Department of Computer Science, The University of Liverpool as master level module.
Release date: January 30th
Submission deadline: March 8, 15:00
Release date: March 1
Submission deadline: April 11, 15:00 HRS
Implementing the k-means Clustering Algorithm Download CA2data.txt
Submission deadline: April 11 15:00 HRS
Exam date: 25th May, 10:00-12:30 6 Guild, Mountford Hall (18), 34 Chadwick Bdg, Barkla LT (29)
The concepts that we will be learning in the lectures will be further developed using a series of programming tutorials. We will both implement some of the algorithms we learn in the course using Python as well as use some of the machine learning and data mining tools freely available. The two lab sessions are identical and you only need to attend one of the sessions per week. If your student number is even attend the Thursday session, else attend the Friday session. Attendance is not marked for the lab sessions, which are optional.
Following the notebook, try various data structures, functions and classes from Python. You can refer Official Tutorial, excellent and biref overview for further details. You can get IPython shell
Task: Write a python program to measure the similarity between two given sentences. First, compute the set of words in each sentence, and then use the Jaccard coefficient to measure the similarity between the two given sentences. You can see the solution here.
The sample code we used in the lab is here
Numpy is a python library that provides data structures useful for data mining such as arrays, and various functions on those data structures. Follow the official numpy tutorial and familiarize yourself with numpy.
Task: Using numpy measure the cosine similarity between two given sentences. You can see the solution here [HTML] [notebook].
Lets perform various preprocessing steps on a set of feature vectors such as L1/L2 normalization and [0,1] scaling, Gaussian scaling using numpy. You can download the [notebook] and the [HTML] versions for this task.
Lets implement a k-NN classifier to perform binary classification and evaluate its accuracy. You can download the [notebook] and the [HTML] versions for this task.
Lets implement a naive Bayes classifier to classify the data used in CA1. You can download a sample program here [Naive Bayes] Uncompress the zip archive and run the python program to obtain the classification accuracy on test data. Modify the code and see what happens if we do not use Laplace smoothing.
We will count the co-occurrences of words in a given collection of documents, and compute pointwise mutual information, chi-squared measure, and the log-likelihood ratio. See the lecture notes on text mining for the definition of those word association measures. The sample program that implements the pointwise mutual information can be downloaded from here. When you uncompress the archive you will find two files: corpus.txt (contains 2000 sentences one per each line, we will assume each line to be a co-occurrence window) and cooc.py (computes the co-occurrences between words from corpus.txt and prints the top-ranked word-pairs in terms of their pointwise mutual information values). You are required to implement the Chi-squared measure and the log-likelihood ratio. There is a larger 100,000 sentence corpus (large_corpus.txt) also provided if you wish to experiment with larger text datasets.
Let us implement PCA and project some data points from high dimensional space to low dimensional space. You can download the [notebook] and the [HTML] versions for this task.
Let us build an inverted index and process queries containing two keywords. Download and unzip this file and run the search.py script. It will index the 20 newsgroups dataset. We will assign integer ids to file names and sort the posting lists. Next, issue a query at the prompt containing two words separated by a space. We will then find the corresponding posting lists and find their intersection. Once you have carefully studied the behaviour of the matching algorithm (uncomment the print statements inside get_results function if needed), extend search.py to handle (a) queries containing arbitrary number of keywords and (b) use skip-pointers to speed up the look up.
Download the source code and data and study NN.py. This script implements a multilayer neural network with a softmax output and a single hidden layer. Hyperbolic tangent (tanh) activation is used in the hidden layer. The NN classifies the Iris dataset that you used in CA1.
The following problem sets are for evaluating your understanding on the various topics that we have covered in the lectures. Try these by yourselves first. We will dicusss the solutions during the lectures and lab sessions later. You are not required to submit your solutions and they will not be marked or counting towards your final mark of the module. The problem sets are for self-assessment only.
Problem Set 0 is out.
Problem Set 1 is out.
Problem Set 2 is out.
Problem Set 3 is out.
The following summer MSc projects are available to CS students at UoL. If you are interested please contact me.
Deep learning has received much attention lately due to its impressive performance in various real-world applications. Deep Learning systems have already outperformed humans in various tasks. For example, in Large Scale Visual Recognition Challenge (ILSVRC) every year, deeper and complex neural network models have outperformed humans in recognising objects in images. Machine translation systems such as Google’s machine translation and Microsoft’s real time voice-to-voice translation system are powered by bi-directional sequence-to-sequence models and neural language models. In this project, we will explore the frontiers of deep learning. Specifically, we will explore the power of a recently proposed deep learning architecture called Generative Adversarial Networks or GANs. A GAN consists of two components: a discriminative model and a generative model. The generative model (counterfeit money producer) would like to generate data that can fool the discriminator (police) and the discriminator would like to separate actual data from the noise. By optimising both discriminator and the generator jointly, we will be able to learn a highly accurate discriminative model at the sametime generate new training data using the generative model. An example use of GAN to generate images can be seen here In this MSc project, we will implement two variants of GANs, f-GAN and WGAN, and compare their performance in an NLP task.
The world is full of languages and unfortunately the availability of information is unequal across different languages. Some information might be available only in a particular language, which might not be understood by non-speakers of that language. Machine translation (MT) has emerged as an attractive solution to this language barrier to information access. Machine translation systems that can accurately and efficiently translate documents across a wide range of languages have been developed such as Neural Machine Translation (NMT) used in Google MT. Unfortunately, MT systems are not yet perfect, and humans too need to be in the translation loop. In this project, we will consider the problem of automatically evaluating the quality of a human translated text using bilingual word embeddings. The project is related to an ongoing industrial collaboration here at UoL, and if successful, you will have the opportunity to contribute to a system that will be used by millions of users across the world!
There is no specific official text book for this course. The following is a recommended list of text books, papers, web sites for the various topics covered in this course.
Pattern Recognition and Machine Learning, by Chris Bishop. For machine learning related topics
A Course in Machine Learning, by Hal Daume III. Excellent introductory material on various topics on machine learning.
Data Mining: Practical Machine Learning Tools and Techniques by Ian Witten. For decision tree learners, associative rule mining, data pre-processing related topics.
Foundations of Statistical Natural Language Processing by Christopher Manning. For text processing/mining related topics
Introduction to Linear Algebra by Gilbert Strang is a good reference to brush up linear algebra related topics. MIT video lectures based on the book are also available
An excellent reference of maths required for data mining and machine learning by Hal Daume III.
numpy (Python numeric processing)
scipy (Python MATLAB like functions)
LIBSVM (SVM library available written in C and with bindings for numerous languages including Python)
scikit-learn (Machine Learning in Python)