Tokenising
Monday 9 November 2020
I am using the NLTK module for this. To install this I had to type:
pip install nltkI also had to download sqlite3.dll and add this to anaconda3/DLLs to make the module work.
python -m nltk.downloader all
So some simple code:
from nltk.tokenize import word_tokenizeThis outputs ['God', 'is', 'Great', '!', 'I', 'won', 'a', 'lottery', '.']
text = "God is Great! I won a lottery."
print(word_tokenize(text))
Project: Write a program that reads a text file into memory, then tokenises it on the space character.
I have saved a file called textfile.txt which has the following text in it:
With gyms and pools shut as England's second lockdown begins, many of us will be digging out the running kit again. But for some, the thought of running in the dark is a scary one - so how can we exercise safely?
Hannah Baptiste keeps fit by playing in a women's football team and running.
"I felt safe going to the park and playing football in the spring because I could see people around me but now the thought of finishing work and going for a run - it's really not ideal," Hannah tells Radio 1 Newsbeat.
My code:
from nltk.tokenize import word_tokenizeNext, I added to the code so I could count the instance of each word, and then sort them by frequency. So the new code now is:
# open the file, read the data into a string called data and then close the file
f = open("textfile.txt", "r")
data = f.read()
f.close()
# before I tokenize this, I need to remove some characters which will appear as separate tokens
remove = [',', '?', "'", '.', '``', '"', '-']
for i in remove:
data = (data.replace(i, ''))
# create a list which tokenizes the string
tokenized = word_tokenize(data)
# print number of words in file
print(len(tokenized),"words in file")
from nltk.tokenize import word_tokenizeThis outputs:
# open the file, read the data into a string called data and then close the file
f = open("textfile.txt", "r")
data = f.read()
f.close()
# before I tokenize this, I need to remove some characters which will appear as separate tokens
remove = [',', '?', "'", '.', '``', '"', '-']
for i in remove:
data = (data.replace(i, ''))
# create a list which tokenizes the string
tokenized = word_tokenize(data)
# print number of words in file
print(len(tokenized),"words in file\n")
# create an object called counter
counter = {}
# count words in tokenized
for i in tokenized:
if i in counter:
counter[i] += 1
else:
counter[i] = 1
# sort the dictionary by number of words
sorted_word_count = sorted(counter.items(), key=lambda x: x[1], reverse=True)
# I want to print all words appearing more than once now
print("All words that appear more than once, in order of frequency:")
for i in sorted_word_count:
if i[1] > 1:
print(i[0], i[1])
95 words in file
All words that appear more than once, in order of frequency:
the 6
and 4
of 3
running 3
in 3
a 3
for 2
thought 2
Hannah 2
playing 2
football 2
I 2
going 2
More posts in python
- Tokenising
- File handling