Data Mining is a multideciplinary field touching important topics across machine learning, natural language processing, information retrieval, and optimisation. This lecture is delivered in the second semester at the Department of Computer Science, The University of Liverpool as master level module. This module requires undergraduate-level knowledge in linear algebra, calculus and probability theory as mathematical preliminaries. All assignments must be submitted in Python 3.0 and the lab sessions will require prior knowledge in Python programming language.
Mondays 11:00-12:00 BROD-108
Wednesdays 12:00-13:00 BROD-108
Thursdays 15:00-16:00 ELEC-202
Release date: January 28th
Submission deadline: March 5th, 15:00
Release date: March 6th
Submission deadline: April 1st, 15:00 HRS
Implementing the k-means Clustering Algorithm Download CA2data.txt
Exam date: 28th May 10:00-12:30 Exam room 31 Engineering Walker LT
Release date: July 6th
Submission deadline: July 31st 15:00 HRS (via Email to danushka)
Release date: July 6th
Implementing Hierarchical Clustering Algorithm Download CA2data.txt
Submission deadline: July 31st 15:00 HRS (via Email to danushka)
Exam date: TBD
See below for past resit questions and answers. Please use these in your revision for the resit.
The concepts that we will be learning in the lectures will be further developed using a series of programming tutorials. We will both implement some of the algorithms we learn in the course using Python as well as use some of the machine learning and data mining tools freely available. The two lab sessions are identical and you only need to attend one of the sessions per week. If your student number is even attend then attend the Monday lab session, else attend the Thursday lab session. Attendance marked for the lab sessions.
We will be using Python ver 3.0 as the preferred programming language for learning Data Mining algorithms. All course work must be submitted in Python as well. Only external libraries we will use are NumPy and Matplotlib. All sample code related to lab tasks are hosted in Google colab. You will need a Google account to download the sample code to your Google drive and execute.
Task: Write a python program to measure the similarity between two given sentences. First, compute the set of words in each sentence, and then use the Jaccard coefficient to measure the similarity between the two given sentences. You can see the solution here.
Numpy is a python library that provides data structures useful for data mining such as arrays, and various functions on those data structures. Follow the official numpy tutorial and familiarize yourself with numpy.
Task: Using numpy measure the cosine similarity between two given sentences. You can see the solution here [HTML].
This lab task will show you how to perform various vector and matrix operations/arithmetic using Python and numpy. Walk through the various functions and familiarise yourself with the code. If unsure, check the official documentation at the numpy website for the different functions and the arguments/values they can take. Here is the [norebook].
Lets perform various preprocessing steps on a set of feature vectors such as L1/L2 normalization and [0,1] scaling, Gaussian scaling using numpy. You can download the [notebook] for this task.
Lets implement a k-NN classifier to perform binary classification and evaluate its accuracy. You can download the [notebook]
Lets implement a naive Bayes classifier for text classification. You can download a sample program and data here [Naive Bayes] Uncompress the zip archive and run the python program to obtain the classification accuracy on test data. train and test files contain positively and negatively labelled product reviews. The reviews are preprocessed and expressed as a list of unigrams and bigrams (indicated by underscores) following the bag-of-words model. Each review is represented as a single line in the file and features (unigrams and bigrams) extracted from that review are listed space delimitted. Modify the Naive Bayes code and see what happens if we do not use Laplace smoothing.
We will count the co-occurrences of words in a given collection of documents, and compute pointwise mutual information, chi-squared measure, and the log-likelihood ratio. See the lecture notes on text mining for the definition of those word association measures. The sample program that implements the pointwise mutual information can be downloaded from here. When you uncompress the archive you will find two files: corpus.txt (contains 2000 sentences one per each line, we will assume each line to be a co-occurrence window) and cooc.py (computes the co-occurrences between words from corpus.txt and prints the top-ranked word-pairs in terms of their pointwise mutual information values). You are required to implement the Chi-squared measure and the log-likelihood ratio. There is a larger 100,000 sentence corpus (large_corpus.txt) also provided if you wish to experiment with larger text datasets.
Let us implement PCA and project some data points from high dimensional space to low dimensional space. You can download the [notebook].
Let us build an inverted index and process queries containing two keywords. Download and unzip this file and run the search.py script. It will index the 20 newsgroups dataset. We will assign integer ids to file names and sort the posting lists. Next, issue a query at the prompt containing two words separated by a space. We will then find the corresponding posting lists and find their intersection. Once you have carefully studied the behaviour of the matching algorithm (uncomment the print statements inside get_results function if needed), extend search.py to handle (a) queries containing arbitrary number of keywords and (b) use skip-pointers to speed up the look up.
Download the source code and data and study NN.py. This script implements a multilayer neural network with a softmax output and a single hidden layer. Hyperbolic tangent (tanh) activation is used in the hidden layer. The NN classifies the Iris dataset that you used in CA1.
The following problem sets are for evaluating your understanding on the various topics that we have covered in the lectures. Try these by yourselves first. We will dicusss the solutions during the lectures and lab sessions later. You are not required to submit your solutions and they will not be marked or counting towards your final mark of the module. The problem sets are for self-assessment only.
Here is a list of useful references.
Mathematics for Machine Learning Read Part I (in parcticular chapters 2, 3, 4, 5 and 6).
Pattern Recognition and Machine Learning, by Chris Bishop. For machine learning related topics
A Course in Machine Learning, by Hal Daume III. Excellent introductory material on various topics on machine learning.
Data Mining: Practical Machine Learning Tools and Techniques by Ian Witten. For decision tree learners, associative rule mining, data pre-processing related topics.
Foundations of Statistical Natural Language Processing by Christopher Manning. For text processing/mining related topics
Introduction to Linear Algebra by Gilbert Strang is a good reference to brush up linear algebra related topics. MIT video lectures based on the book are also available
An excellent reference of maths required for data mining and machine learning by Hal Daume III.
numpy (Python numeric processing)
scipy (Python MATLAB like functions)