In [2]:

```
A = "I love data mining"
B = "I hate data mining"
```

First thing we need to do is to import numpy

In [3]:

```
import numpy
```

In [4]:

```
wordsA = A.lower().split()
wordsB = B.lower().split()
```

In [5]:

```
print wordsA
print wordsB
```

In [6]:

```
vocab = set(wordsA)
vocab = vocab.union(set(wordsB))
```

Lets print all the features in the vocabulary vocab.

In [7]:

```
print vocab
```

In [8]:

```
vocab = list(vocab)
```

You can see the list of unique features as follows:

In [9]:

```
print vocab
```

In [10]:

```
vA = numpy.zeros(len(vocab), dtype=float)
vB = numpy.zeros(len(vocab), dtype=float)
```

In [11]:

```
for w in wordsA:
i = vocab.index(w)
vA[i] += 1
```

Lets print this vector.

In [12]:

```
print vA
```

We can do the same procedure to populate the vector for the second sentence as follows:

In [13]:

```
for w in wordsB:
i = vocab.index(w)
vB[i] += 1
```

In [14]:

```
print vB
```

Again check that the vector is correctly populated for the second sentence.

Next, we will compute the cosine similarity between the two vectors. The cosine similarity between two vectors x and y is defined as follows:

cos(x,y) = numpy.dot(x,y) / (numpy.sqrt(numpy.dot(x,x)) * numpy.sqrt(numpy.dot(y,y)))

In [16]:

```
cos = numpy.dot(vA, vB) / (numpy.sqrt(numpy.dot(vA,vA)) * numpy.sqrt(numpy.dot(vB,vB)))
```

In [17]:

```
print cos
```

In [18]:

```
print numpy.dot(vA, vB) / (numpy.linalg.norm(vA) * numpy.linalg.norm(vB))
```

As you can see you get the same result but the code looks better.