We will write a program that generates the input to a word-cloud generator program. A word-cloud is a representation of a collection of words such that the most frequently used words in the collection appear in a larger font. This example shows where students are physically located this term : see image.

We will not be doing the fun part - generating the words in pretty colours and making it look cool. Instead, we'll be writing the code that reads a text file containing some text from a speech or a blog (or whatever.) and from this we will generate an output file consisting of the words in the text and their frequencies. For instance:

Kingston 20
Toronto 45
Ottawa 36

This file could then be read by a word-cloud program that would generate the picture above.

For this assignment, I am not giving you complete skeleton code, but instead an outline of the functions that are required, their parameters and what they should return. Your program should follow this outline and not deviate. Deviations from the outline (that is, changing the parameters or returns, adding functions, leaving out functions etc.) will result in a lower mark. Please follow the outline.

You will also need to write appropriate docstrings (comments at the start of each function). For the proper format you can refer to the skeleton code provided in Assignment 8. Each function should be defined with a complete docstring to describe the functionality, the parameters and the return values.

Here are the functions that you need to write:

1. readFile() - this function takes no parameters and it returns a list where each element is a word in the file. The function will open the file called "cisc101WordCloudFile.txt" (provided as an attachment to this assignment), read the contents into a string and convert the string to a list of words. Note that some words may have \n characters on the end or contain punctuation. The \n characters should be removed. Punctuation can stay. You might find the .split() method useful to split a string into a list of words. (Try this: "the quick brown fox".split() and see what you end up with). The file should be found in the same location as your program, so no absolute path should be indicated in the open() function. Be sure to check that the file is found and opened properly using exceptions. If not, inform your user and end the program. DO NOT CHANGE THE NAME OF THE FILE.

2. isValid(word) - this function takes one parameter a string (a single word). It returns True if the word should be kept in the list, False otherwise. Words are considered valid if they are 4 or more characters in length, they do not start or end with a digit and they do not contain any punctuation marks. You will find the isdigit() string method useful for this function. You will find this site useful when it comes to trying to figure out whether or not a word contains punctuation marks.https://www.geeksforgeeks.org/string-punctuation-in-python/

3. cleanseWords(listOfWords) - this function will remove any words that we don't want to keep in our word-cloud. It takes as a parameter a list of words and modifies the list to remove some words. Nothing is returned, but the list may be modified in the function. Here is how this function should be structured. Note that you want to traverse from the end of the list to the beginning so that we can safely remove words from the end of the list. (If you go the other way and remove things from the start of the list while looping through the values, you run into problems with indexing). How do you go from the end to the beginning? for i in range(len(words)-1,-1, -1) will do this for you.

for each word starting at the end of the list to the beginning
make the word all lower case
check to see if the word is valid (by calling isValid()
if not a valid word, remove it from the list.

4. countFrequencies(listOfWords) - this function takes the list of words and creates a dictionary consisting of word: frequency elements. So, for instance, if our list of words was ["to", be, or, not, to, be], we would return a dictionary consisting of the following: {to: 2, be: 2, or: 1, not: 1}. Approach this in the following way:

for each word in the list:
if it appear in the dictionary, increment dictionary[word] by 1.
else:
add the word to the dictionary with a count of 1.

5. writeFile(dictionaryOfFrequencies) - this function takes as a parameter the dictionary consisting of the word: frequency elements and writes the contents to a file called "outputForWordCloud101.txt". There should be no path provided the file should be written where the code is located. Each word: frequency pair should be on a separate line in the file. So for the example given above, the output file would look like this:

to 2
be 2
or 1
not 1

Be sure to check that all file I/O operations succeed by using an exception handler. If not, inform your user and end the program.

6. main() - main will call the functions in the following order (with appropriate parameters)

a. readFile()
b. cleanseWords()
c. countFrequencies()
d. writeFile()

Suggestions

1) Write each function and test it individually. You can, for instance, start with the function isValid(). It takes a string and returns True or False depending on the conditions.

So, to test this function, you don't need to have read the file using your code - you can just make up inputs. For instance, I can put the following code in my program to test this function:

print(isValid(“7abcd”), “This should produce a False result since it starts with a number”)
print(isValid(“and”), “This should produce a False result since it is too short”)
etc ...

You could write the function writeFile() without any other functions. Simply pass it in a dictionary of fake data and check that the file is created properly.

You do not need to show this testing - it is just to give you an idea of how to build up your program. Build it one function at a time. Doing this will allow you to get credit for what you do get working even if you do not get the entire program to work together.

2) The file that I have given you is long! Test your code by creating a smaller test function where you know exactly what the input is . Make sure it works on this file first then run it on the longer file.

Challenges

There are many extensions that you could add to this assignment. This section is optional (and not for marks). Hand in only what I have asked you to do, but if you want some challenges, you could do the following:

  • Remove the punctuation from words (right now the words "world." and "them." show up as unique words. Remove all punctuation before processing.
  • Write the file with the word frequencies sorted -- so the most frequently used word is at the top.
  • Remove any words that appear only once in the file.
Academic Honesty!
It is not our intention to break the school's academic policy. Posted solutions are meant to be used as a reference and should not be submitted as is. We are not held liable for any misuse of the solutions. Please see the frequently asked questions page for further questions and inquiries.
Kindly complete the form. Please provide a valid email address and we will get back to you within 24 hours. Payment is through PayPal, Buy me a Coffee or Cryptocurrency. We are a nonprofit organization however we need funds to keep this organization operating and to be able to complete our research and development projects.