In [70]:

```
import numpy
```

In [71]:

```
N = 5
```

In [72]:

```
pos = numpy.random.multivariate_normal([0,0], [[1,0],[0,1]], 2 * N)
neg = numpy.random.multivariate_normal([2,2], [[1,0],[0,1]], 2 * N)
```

In [73]:

```
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
```

In [74]:

```
plt.scatter(neg[:,0], neg[:,1], c='r')
plt.scatter(pos[:,0], pos[:,1], c='b')
```

Out[74]:

They do look decently separated. We might be able to learn a binary classifier using k-NNs.

Lets split our dataset into equal train and test portions.

In [76]:

```
train_pos = pos[:N,:]
test_pos = pos[N:,:]
train_neg = neg[:N,:]
test_neg = neg[:N,:]
```

In [77]:

```
print len(train_pos), len(train_neg)
```

In [78]:

```
print len(test_pos), len(test_neg)
```

Lets assign pos (+1) and neg (-1) labels to our train and test instances.

In [83]:

```
train_data = [(1, x) for x in train_pos]
train_data.extend([(-1, x) for x in train_neg])
```

In [84]:

```
test_data = [(1, x) for x in test_pos]
test_data.extend([(-1, x) for x in test_neg])
```

To see that we have properly made tuples of (label, instance) lets print train and test data.

In [85]:

```
print train_data
```

In [86]:

```
print test_data
```

In [87]:

```
def sim(p, q):
score = numpy.dot(p,q) / (numpy.linalg.norm(p) * numpy.linalg.norm(q))
return score
```

Now lets implement a function that predicts the label of a test instance using the k-NN algorithm.

In [ ]:

```
def predict(x, k):
L = [(y, sim(x, z)) for (y,z) in train_data]
L.sort(lambda a,b: -1 if a[1] > b[1] else 1)
#print L[:k]
score = sum([e[0] for e in L[:k]])
if score > 0:
return 1
else:
return -1
```

Take a moment to study the predict function. k-NN happens here. We are given k and the instance to be classified, x. The first thing we do is computing the similarity scores between x and each instance z in our train dataset. We must also store the labels so that we can later find the majority label.

Next, we need to find the neighbours. For that we sort this list of tuples by the value of the second item in tuples, which is similarity. lambda expressions are convenient ways to write in-place functions. Here, we take two elements from our list, compare their similarity scores and return -1 or +1. The sort function will then use this to sort the list. In this case, it will sort in the descending order of similarity scores.

If you would like to confirm that it is indeed the descending order you can print the list after sorting (uncomment that line).

Next, we must find the majority label. Since we are doing binary classification and our labels are -1 and +1, when we add the labels for the nearest neigbours if we get a positive value then there must be more +1s than -1s, vice versa. You might have to do more complicated stuff for finding the majority label if there were more than 2 classes. But it is easy for the binary case as shown here.

In [98]:

```
print predict(test_data[0][1], 5)
```

Lets compute the accuracy of our k-NN classifier.

In [101]:

```
corrects = 0
k = 5
for (y,x) in test_data:
if y == predict(x, k):
corrects += 1
accuracy = float(corrects) / float(len(test_data))
print "Accuracy =", accuracy
```

So we have a decent? classifier here. Try the following things:

- change the value of k
- increase the number of instances N
- separate or bring together the two classes by adjusting the means of the two Gaussians.

How does the accuracy vary in each case?

In [ ]:

```
```