Data Mining

Data Mining is a multideciplinary field touching important topics across machine learning, natural language processing, information retrieval, and optimisation. This lecture is delivered in the second semester at the Department of Computer Science, The University of Liverpool as master level module. This module requires undergraduate-level knowledge in linear algebra, calculus and probability theory as mathematical preliminaries. All assignments must be submitted in Python 3.0 and the lab sessions will require prior knowledge in Python programming language.

Days and Locations

Mondays 11:00-12:00 BROD-108
Wednesdays 12:00-13:00 BROD-108
Thursdays 15:00-16:00 ELEC-202

QA forum is available in Vital

Video streams are available in stream.liv

Lecture schedule and slides

#	Date	Title	Slides
1.	Jan 28	Introduction to Data Mining
2.	Jan 30	Mathematical Preliminaries
3.	Jan 31	Mathematical Preliminaries
4.	Feb 4	Data Representation
5.	Feb 6	Perceptron
6.	Feb 7	Missing value handling, labeleing and noisy data
7.	Feb 11	k-NN classifier
8.	Feb 13	Classifier Evaluation
9.	Feb 14	Naive Bayes classifier
10.	Feb 18	Problem Set 1
11.	Feb 20	Decision Tree Learner
12.	Feb 21	Logistic regression.
13.	Feb 25	Problem Set 2
14.	Feb 27	Text mining. Part 1 (Bag-of-Words Model)
15.	Feb 28	Text mining. Part 2 (POS tagging, NER)
16.	March 4	Text mining. Part 3 (Relation Extraction)
17.	March 6	k-means clustering
18.	March 7	Cluster evaluation measures	[above]
19.	March 11	Dimensionality reduction (SVD)
20.	March 13	Dimensionality reduction (PCA)	[above]
21.	March 14	Problem Set 3
22.	March 18	Information Retrieval
23.	March 20	Graph mining.
24.	March 21	Neural networks
25.	April 1	Deep Learning
26.	April 3	Word Representations
27.	April 4	Word Representations	[above]
28.	April 29	Sequential data
29.	May 1	Revision-1
30.	May 2	Revision-2

Assignment 1 (12% of course marks)

Release date: January 28th
Submission deadline: March 5th, 15:00
Implementing the Perceptron Algorithm Download CA1data.zip

Assignment 2 (13% of course marks)

Release date: March 6th
Submission deadline: April 1st, 15:00 HRS
Implementing the k-means Clustering Algorithm Download CA2data.txt

Final Exam (75% of course marks)

Exam date: 28th May 10:00-12:30 Exam room 31 Engineering Walker LT

Resit Assignment 1

Release date: July 6th

Implementing the k-NN Algorithm Download CA1data.zip

Submission deadline: July 31st 15:00 HRS (via Email to danushka)

Resit Assignment 2

Release date: July 6th

Implementing Hierarchical Clustering Algorithm Download CA2data.txt

Submission deadline: July 31st 15:00 HRS (via Email to danushka)

Resit Exam

Exam date: TBD

See below for past resit questions and answers. Please use these in your revision for the resit.

Past Exams with Answers

2017-18 Final Exam, Answers
2017-18 Resit Exam, Answers
2016-17 Final Exam, Answers
2016-17 Resit Exam, Answers
2015-16 Final Exam, Answers
2015-16 Resit Exam, Answers

Lab Sessions / Tutorials

The concepts that we will be learning in the lectures will be further developed using a series of programming tutorials. We will both implement some of the algorithms we learn in the course using Python as well as use some of the machine learning and data mining tools freely available. The two lab sessions are identical and you only need to attend one of the sessions per week. If your student number is even attend then attend the Monday lab session, else attend the Thursday lab session. Attendance marked for the lab sessions.

Location: Mondays 12:00-13:00 GHOLT-H105 (Lab 3)

Location: Thursdays 09:00-10:00 GHOLT-H105 (Lab 3)

Lab Tasks

Python basics

We will be using Python ver 3.0 as the preferred programming language for learning Data Mining algorithms. All course work must be submitted in Python as well. Only external libraries we will use are NumPy and Matplotlib. All sample code related to lab tasks are hosted in Google colab. You will need a Google account to download the sample code to your Google drive and execute.

Task: Write a python program to measure the similarity between two given sentences. First, compute the set of words in each sentence, and then use the Jaccard coefficient to measure the similarity between the two given sentences. You can see the solution here.
Numpy basics

Numpy is a python library that provides data structures useful for data mining such as arrays, and various functions on those data structures. Follow the official numpy tutorial and familiarize yourself with numpy.

Task: Using numpy measure the cosine similarity between two given sentences. You can see the solution here [HTML].
Linear algebra with Python

This lab task will show you how to perform various vector and matrix operations/arithmetic using Python and numpy. Walk through the various functions and familiarise yourself with the code. If unsure, check the official documentation at the numpy website for the different functions and the arguments/values they can take. Here is the [norebook].
Data preprocessing

Lets perform various preprocessing steps on a set of feature vectors such as L1/L2 normalization and [0,1] scaling, Gaussian scaling using numpy. You can download the [notebook] for this task.
Implement a k-NN Classifier

Lets implement a k-NN classifier to perform binary classification and evaluate its accuracy. You can download the [notebook]
Naive Bayes Classifier

Lets implement a naive Bayes classifier for text classification. You can download a sample program and data here [Naive Bayes] Uncompress the zip archive and run the python program to obtain the classification accuracy on test data. train and test files contain positively and negatively labelled product reviews. The reviews are preprocessed and expressed as a list of unigrams and bigrams (indicated by underscores) following the bag-of-words model. Each review is represented as a single line in the file and features (unigrams and bigrams) extracted from that review are listed space delimitted. Modify the Naive Bayes code and see what happens if we do not use Laplace smoothing.
Co-occurrence Measures

We will count the co-occurrences of words in a given collection of documents, and compute pointwise mutual information, chi-squared measure, and the log-likelihood ratio. See the lecture notes on text mining for the definition of those word association measures. The sample program that implements the pointwise mutual information can be downloaded from here. When you uncompress the archive you will find two files: corpus.txt (contains 2000 sentences one per each line, we will assume each line to be a co-occurrence window) and cooc.py (computes the co-occurrences between words from corpus.txt and prints the top-ranked word-pairs in terms of their pointwise mutual information values). You are required to implement the Chi-squared measure and the log-likelihood ratio. There is a larger 100,000 sentence corpus (large_corpus.txt) also provided if you wish to experiment with larger text datasets.
Principal Component Analysis

Let us implement PCA and project some data points from high dimensional space to low dimensional space. You can download the [notebook].
Build your own search engine!

Let us build an inverted index and process queries containing two keywords. Download and unzip this file and run the search.py script. It will index the 20 newsgroups dataset. We will assign integer ids to file names and sort the posting lists. Next, issue a query at the prompt containing two words separated by a space. We will then find the corresponding posting lists and find their intersection. Once you have carefully studied the behaviour of the matching algorithm (uncomment the print statements inside get_results function if needed), extend search.py to handle (a) queries containing arbitrary number of keywords and (b) use skip-pointers to speed up the look up.
Train a multi-layer feed-forward neural network

Download the source code and data and study NN.py. This script implements a multilayer neural network with a softmax output and a single hidden layer. Hyperbolic tangent (tanh) activation is used in the hidden layer. The NN classifies the Iris dataset that you used in CA1.

Problem Sets

The following problem sets are for evaluating your understanding on the various topics that we have covered in the lectures. Try these by yourselves first. We will dicusss the solutions during the lectures and lab sessions later. You are not required to submit your solutions and they will not be marked or counting towards your final mark of the module. The problem sets are for self-assessment only.

References

Here is a list of useful references.

Mathematics for Machine Learning Read Part I (in parcticular chapters 2, 3, 4, 5 and 6).
Pattern Recognition and Machine Learning, by Chris Bishop. For machine learning related topics
A Course in Machine Learning, by Hal Daume III. Excellent introductory material on various topics on machine learning.
Data Mining: Practical Machine Learning Tools and Techniques by Ian Witten. For decision tree learners, associative rule mining, data pre-processing related topics.
Foundations of Statistical Natural Language Processing by Christopher Manning. For text processing/mining related topics
Introduction to Linear Algebra by Gilbert Strang is a good reference to brush up linear algebra related topics. MIT video lectures based on the book are also available
An excellent reference of maths required for data mining and machine learning by Hal Daume III.
numpy (Python numeric processing)
scipy (Python MATLAB like functions)

Data Mining

Days and Locations

QA forum is available in Vital

Video streams are available in stream.liv

Lecture schedule and slides

Assignment 1 (12% of course marks)

Assignment 2 (13% of course marks)

Final Exam (75% of course marks)

Exam date: 28th May 10:00-12:30 Exam room 31 Engineering Walker LT

Resit Assignment 1

Release date: July 6th

Implementing the k-NN Algorithm Download CA1data.zip

Submission deadline: July 31st 15:00 HRS (via Email to danushka)

Resit Assignment 2

Release date: July 6th

Implementing Hierarchical Clustering Algorithm Download CA2data.txt

Submission deadline: July 31st 15:00 HRS (via Email to danushka)

Resit Exam

Exam date: TBD

See below for past resit questions and answers. Please use these in your revision for the resit.

Past Exams with Answers

Lab Sessions / Tutorials

Location: Mondays 12:00-13:00 GHOLT-H105 (Lab 3)

Location: Thursdays 09:00-10:00 GHOLT-H105 (Lab 3)

Lab Tasks

Python basics

Numpy basics

Linear algebra with Python

Data preprocessing

Implement a k-NN Classifier

Naive Bayes Classifier

Co-occurrence Measures

Principal Component Analysis

Build your own search engine!

Train a multi-layer feed-forward neural network

Problem Sets

References