The Red Penguin

Tokenising


Monday 9 November 2020

I am using the NLTK module for this. To install this I had to type:
pip install nltk
python -m nltk.downloader all
I also had to download sqlite3.dll and add this to anaconda3/DLLs to make the module work.

So some simple code:
from nltk.tokenize import word_tokenize
text = "God is Great! I won a lottery."
print(word_tokenize(text))
This outputs ['God', 'is', 'Great', '!', 'I', 'won', 'a', 'lottery', '.']

Project: Write a program that reads a text file into memory, then tokenises it on the space character.

I have saved a file called textfile.txt which has the following text in it:

With gyms and pools shut as England's second lockdown begins, many of us will be digging out the running kit again. But for some, the thought of running in the dark is a scary one - so how can we exercise safely?

Hannah Baptiste keeps fit by playing in a women's football team and running.

"I felt safe going to the park and playing football in the spring because I could see people around me but now the thought of finishing work and going for a run - it's really not ideal," Hannah tells Radio 1 Newsbeat.


My code:
from nltk.tokenize import word_tokenize

# open the file, read the data into a string called data and then close the file
f = open("textfile.txt", "r")
data = f.read()
f.close()

# before I tokenize this, I need to remove some characters which will appear as separate tokens
remove = [',', '?', "'", '.', '``', '"', '-']

for i in remove:
data = (data.replace(i, ''))

# create a list which tokenizes the string
tokenized = word_tokenize(data)

# print number of words in file
print(len(tokenized),"words in file")
Next, I added to the code so I could count the instance of each word, and then sort them by frequency. So the new code now is:
from nltk.tokenize import word_tokenize

# open the file, read the data into a string called data and then close the file
f = open("textfile.txt", "r")
data = f.read()
f.close()

# before I tokenize this, I need to remove some characters which will appear as separate tokens
remove = [',', '?', "'", '.', '``', '"', '-']

for i in remove:
data = (data.replace(i, ''))

# create a list which tokenizes the string
tokenized = word_tokenize(data)

# print number of words in file
print(len(tokenized),"words in file\n")

# create an object called counter
counter = {}

# count words in tokenized
for i in tokenized:
if i in counter:
counter[i] += 1
else:
counter[i] = 1

# sort the dictionary by number of words
sorted_word_count = sorted(counter.items(), key=lambda x: x[1], reverse=True)

# I want to print all words appearing more than once now
print("All words that appear more than once, in order of frequency:")
for i in sorted_word_count:
if i[1] > 1:
print(i[0], i[1])
This outputs:

95 words in file

All words that appear more than once, in order of frequency:
the 6
and 4
of 3
running 3
in 3
a 3
for 2
thought 2
Hannah 2
playing 2
football 2
I 2
going 2

File handling


Monday 9 November 2020

This code opens a file stored in the same directory as your python file, reads it and then closes the file:
f = open("textfile.txt", "r")
data = f.read()
f.close()
# opens the file, reads the data into a string called data and then closes it
We can then do things with the data:
words = data.split(" ")
# creates a list with all of the words in it

print(words)
# outputs ['The', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']

print(len(words))
# outputs 9
Let's add some code to see if we can count the instances in the "words" list.
counter = {}
# create an object called counter

for i in words:
if i in counter:
counter[i] += 1
else:
counter[i] = 1

print(counter)
# outputs {'The': 1, 'quick': 1, 'brown': 1, 'fox': 1, 'jumped': 1, 'over': 1, 'the': 1, 'lazy': 1, 'dog': 1}
I think what is happening here is that we are creating a dictionary, and then looking at each member of the words list and if it's in the list, adding one to the dictionary key.

More posts in python