The project will give you an opportunity to use Python to model, analyze, and score the similarity of text samples. In essence, you will develop a statistical tool that will allow you to identify a particular author or style of writing!
Statistical models of text are one way to quantify how similar one piece of text is to another. Such models were used as evidence that the book The Cuckoos Calling was written by J. K. Rowling (using the name Robert Galbraith) in the summer of 2013. Details on that story and the analysis that was performed appear here.
The comparative analysis of Rowlings works used surprisingly simple techniques to model the authors style. Possible style signatures could include:
Other similar measures of style are also possible. Perhaps the most common application of this kind of feature-based text classification is spam detection and filtering (which you can read more about here, if youre interested).
To allow you to compare and classify texts, you will create a statistical model of a body of text using several different techniques. At a minimum, you should integrate five features:
Computing word lengths from words should be fairly easy, but computing stems, sentence lengths, and your additional feature are an additional part of the challenge, and offer plenty of room for algorithmic creativity!
In general, the project description below will offer fewer guidelines than the problem sets did. If the guidelines do not explicitly mandate or forbid certain behavior, you should feel free to be creative!
For the first part of the project, you will create an initial version of a TextModel class, which will serve as a blueprint for objects that model a body of text (i.e., a collection of one or more text documents).
To get started, open a new file in IDLE and name it finalproject.py. In this file, declare a class named TextModel. Add appropriate comments to the top of the file.
Write a constructor __init__(self, model_name) that constructs a new TextModel object by accepting a string model_name as a parameter and initializing the following three attributes:
Each of the dictionaries should be initialized to the empty dictionary ({}). In Part III, you will add dictionaries for the other three features as well
Write a method __str__(self) that returns a string that includes the name of the model as well as the sizes of the dictionaries for each feature of the text
For example, if a model for J. K. Rowling has been set-up, the return value of this method may look like:
text model name: J. K. Rowling
number of words: 2103
number of word lengths: 17
Here again, information about the other feature dictionaries will eventually be included, but for now only the first two dictionaries are needed.
Notes:
def __str__(self):
"""Return a string representation of the TextModel."""
s = 'text model name: ' + self.name + 'n'
s += ' number of words: ' + str(len(self.words)) + 'n'
return s
Write a helper function named clean_text(txt) that takes a string of text txt as a parameter and returns a list containing the words in txt after it has been cleaned. This function will be used when you need to process each word in a text individually, without having to worry about punctuation or special characters.
Notes:
>>> s = 'Mr. Smith programs.'
>>> s = s.replace('.', '')
>>> s
'Mr Smith programs'
Write a method add_string(self, s) that adds a string of text s to the model by augmenting the feature dictionaries defined in the constructor.
Here is the beginning of the method to get you started. It includes code for updating the words dictionary:
def add_string(self, s):
"""Analyzes the string txt and adds its pieces
to all of the dictionaries in this text model.
"""
# Add code to clean the text and split it into a list of words.
# *Hint:* Call one of your other methods!
words = ...
# Code for updating the words dictionary.
for w in words:
if w not in self.words:
self.words[w] = 1
else:
self.words[w] += 1
# Add code to update other feature dictionaries.
For now, you should add code to this method to update the word_lengths dictionary. Later, you will extend the method to update the other dictionaries as well.
Write a method add_file(self, filename) that adds all of the text in the file identified by filename to the model.
Hint: You have already written a method that adds a string of text to the model. To implement add_file, you only need to read the file into a string and make the appropriate method call to add the string of text to the model!
Important: When you open the file for reading, you should specify two additional arguments as follows:
f = open(filename, 'r', encoding='utf8', errors='ignore')
These encoding and errors arguments should allow Python to handle special characters (e.g., smart quotes) that may be present in your text files.
At this point, you are ready to run some initial tests on your methods. Try entering the following commands from the Shell, but remember that the contents of your dictionaries may be printed in a different order.
>>> model = TextModel('A. Poor Righter')
>>> model.add_string("The partiers love the pizza party.")
>>> print(model)
text model name: A. Poor Righter
number of words: 5
number of word lengths: 4
>>> model.words
{'party': 1, 'partiers': 1, 'pizza': 1, 'love': 1, 'the': 2}
>>> model.word_lengths
{8: 1, 3: 2, 4: 1, 5: 2}
You should continue testing your code frequently from this point forward to make sure everything is working correctly at each step. Otherwise, if you reach the end and realize there are errors, it can be very difficult to determine the causes of those errors in such a large program!
Creating a text model can require a lot of computational power and time. Therefore, once we have created a model, we want to be able to save it for later use. The easiest way to do this is to write each of the feature dictionaries to a different file so that we can read them back in at a later time. In this part of the project, you will add methods to your TextModel class that allow you to save and retrieve a text model in this way.
To get you started, here is a function that defines a small dictionary and saves it to a file:
def sample_file_write(filename):
"""A function that demonstrates how to write a
Python dictionary to an easily-readable file.
"""
d = {'test': 1, 'foo': 42} # Create a sample dictionary.
f = open(filename, 'w') # Open file for writing.
f.write(str(d)) # Writes the dictionary to the file.
f.close() # Close the file.
Notice that the file is opened for writing by using a second parameter of 'w' in the open function call. In addition, we write to the file by using the file handles write method on a string representation of the dictionary..
Below is a function that reads in a string representing a dictionary from a file, and converts this string (which is a string that looks like a dictionary) to an actual dictionary object. The conversion is performed using a combination of two built-in functions: dict, the constructor for dictionary objects; and eval, which evaluates a string as if it were an expression.
def sample_file_read(filename):
"""A function that demonstrates how to read a
Python dictionary from a file.
"""
f = open(filename, 'r') # Open for reading.
d_str = f.read() # Read in a string that represents a dict.
f.close()
d = dict(eval(d_str)) # Convert the string to a dictionary.
print("Inside the newly-read dictionary, d, we have:")
print(d)
Try saving these functions in a Python file (or even in finalproject.py), and then use the following calls from the Python Shell:
>>> filename = 'testfile.txt'
>>> sample_file_write(filename)
>>> sample_file_read(filename)
Inside the newly-read dictionary, d, we have:
{'test': 1, 'foo': 42}
There should also now be a file named testfile.txt in your current working directory that contains this dictionary.
Now that you know how to write dictionaries to files and read dictionaries from files, add the following two methods to the TextModel class:
Write a method save_model(self) that saves the TextModel object self by writing its various feature dictionaries to files. There will be one file written for each feature dictionary. For now, you just need to handle the words and word_lengths dictionaries.
In order to identify which model and dictionary is stored in a given file, you should use the name attribute concatenated with the name of the feature dictionary. For example, if name is 'JKR' (for J. K. Rowling), then we would suggest using the filenames:
In general, the filenames are self.name + '_' + name_of_dictionary. Taking this approach will ensure that you dont overwrite one models dictionary files when you go to save another model.
It may help to use the code for sample_file_write as a starting point for this method, but dont forget that you should create a separate file for each dictionary.
Write a method read_model(self) that reads the stored dictionaries for the called TextModel object from their files and assigns them to the attributes of the called TextModel.
This is the complementary method to save_model, and you should assume that the necessary files have filenames that follow the naming scheme used in save_model.
It may help to use the code for sample_file_read as a starting point for this method, but note that you should read a separate file for each dictionary.
Remember that you can use the dict and eval functions to convert a string that represents a dictionary to an actual dictionary object.
Examples: You can test your save_model and read_model methods as follows:
# Create a model for a simple text, and save the resulting model.
>>> model = TextModel('A. Poor Righter')
>>> model.add_string("The partiers love the pizza party.")
>>> model.save_model()
# Create a new TextModel object with the same name as the original one,
# and assign it to a new variable.
>>> model2 = TextModel('A. Poor Righter')
# Read the dictionaries that were saved for the original model,
# and use them as the dictionaries of `model2`.
>>> model2.read_model()
>>> print(model2)
text model name: A. Poor Righter
number of words: 5
number of word lengths: 4
>>> model2.words
{'party': 1, 'partiers': 1, 'pizza': 1, 'love': 1, 'the': 2}
>>> model2.word_lengths
{8: 1, 3: 2, 4: 1, 5: 2}
Update your __init__ method so that it initializes attributes for three additional dictionaries:
Write a helper function named stem(s) that accepts a string as a parameter. The function should then return the stem of s. The stem of a word is the root part of the word, which excludes any prefixes and suffixes. For example:
>>> stem('party')
'parti'
>>> stem('parties')
'parti'
>>> stem('love')
'lov'
>>> stem('loving')
'lov'
Notes:
Extend your add_string method to update the feature dictionaries for word stems, sentence lengths, and your chosen additional feature.
Notes:
Update the following methods to incorporate the feature dictionaries for word stems, sentence lengths, and your chosen additional feature:
Test your new code by performing the following test. Here again, your dictionary contents may be printed in a different order. In addition, your dictionary for stems may be slightly different, depending on how you implemented the stem function:
>>> model = TextModel('A. Poor Righter')
>>> model.add_string("The partiers love the pizza party.")
>>> print(model)
text model name: A. Poor Righter
number of words: 5
number of word lengths: 4
number of stems: 4
number of sentence lengths: 1
# info for your additional feature goes here!
>>> model.words
{'party': 1, 'partiers': 1, 'pizza': 1, 'love': 1, 'the': 2}
>>> model.word_lengths
{8: 1, 3: 2, 4: 1, 5: 2}
>>> model.stems
{'parti': 2, 'the': 2, 'pizza': 1, 'lov': 1}
>>> model.sentence_lengths
{6: 1}
Optional: Add extra functionality to your TextModel object as you see fit. This may include improving your algorithms for cleaning the strings or finding stems, or it may entail adding even more feature dictionaries beyond the ones that we have required.
In this part of the project, you will first implement the core algorithm that will allow you to compare bodies of text. This algorithm will produce a numeric similarity score that measures how similar one body of text is to another, based on one type of feature (e.g., word lengths). You will then compute scores of this type for all five of the features, and use them to classify a piece of text as being more likely to come from one source than another.
The similarity score that we will compute is based on a statistical model known as a Naive Bayes probability model. Despite the naive in its name, scores computed using this model have been very successful in distinguishing spam email from non-spam (ham) email, among other classification problems.
In essence, the Naive Bayes scoring algorithm works in the following way: You give it feature counts (e.g., word counts) from one body of text and feature counts from a second body of text. The algorithm will then compute the likelihood that the second body text is from the same source as the first body of text! The reason that the algorithm is called naive is that it makes the assumption that each item in a given feature set is independent. For example, it assumes that the appearance of the word spell does not depend on the appearance of the word potter and that this independence holds for all pairs of words. This assumption is certainly not true, but that turns out not to matter in many situations!
You can read more details about the use of Naive Bayes probabilities for classification on Wikipedia if you would like to know more, but all of the necessary information is summarized below.
To illustrate how the Bayesian scoring algorithm works, lets assume that the only features we care about are the individual words in the texts.
As you have already done in your TextModel class, we can use a Python dictionary to model all of the words in a text. The dictionarys keys are words, and the value for a given word is the number of times that it appears in the text.
For example, lets assume that we have two text documents:
shakespeare_dict = {'love': 50, 'spell': 8, 'thou': 42}
mystery_dict = {'love': 3, 'thou': 1, 'potter': 2, 'spam': 4}
The Bayesian similarity score between these two texts attempts to measure the likelihood that the ten words in the mystery text come from the same class of text as the 100 words in the source text. (A given class of text could be based on a particular author or publication, or on other characteristics of the texts in question.)
To calculate the score, we first take each word in the mystery text and compute a probability for it that is based on the number of times that it occurs in the source text. If a word in the mystery text doesnt occur at all in the source text (which would lead to a probability of 0), we instead compute a probability that is based on a default word count of 0.5. This will allow us to avoid multiplying by 0 when we compute the final score.
Here are the probabilities for the words in our mystery text:
Important: These probabilities have denominators of 100 because the source text has 100 words in it. The denominators should not always be 100!
To compute the similarity score of the mystery text, we need to compute a product in which a given words probability is multiplied by itself n times, where n is the number of times that the word appears in the mystery text. In this case, we would do the following:
# 3 "love" 1 "thou" 2 "potter" 4 "spam"
sim_score = (.5*.5*.5) * (.42) * (.005*.005) * (.005*.005*.005*.005)
This similarity score is very small! In practice, these very small values are hard to work with, and they can become so small that Pythons floating-point values cannot accurately represent them! Therefore, instead of using the probabilities themselves, we will use the logs of the probabilities. The log operation transforms multiplications into additions (and exponents into multiplication), so our log-based similarity score would be:
log_sim_score = 3*log(.5) + 1*log(.42) + 2*log(.005) + 4*log(.005)
This results in a more manageable value of around -34.737. (Note that Pythons math.log function uses the natural log (of base e) by default, which is fine for our purposes.)
The resulting similarity score gives us a measure of how similar the mystery text is to the source text. To classify a new mystery text, we compute similarity scores between it and a collection of known texts in order to determine which of the known texts is most likely to be related to the mystery text.
For example, lets say that we also have the following model for texts by J.K. Rowling:
jkr = {'love': 25, 'spell': 275, 'potter': 700}
We can compute the similarity score in the same way as above, accounting for three love and two potter appearances, and with the default value being used for the missing words:
sim_score = (.025*.025*.025) * (.001) * (.7*.7) * (.0005*.0005*.0005*.0005)
This value is also very small! Using logs:
log_sim_score = 3*log(.025) + 1*log(.001) + 2*log(.7) + 4*log(.0005)
Now the similarity score is approximately -49.091. This value is less than the value that we computed when comparing the mystery text to the Shakespeare text. Therefore, we can conclude that the mystery text is more likely to have come from Shakespeare than from J.K. Rowling.
Write a function (not a method, so it should be outside the class) named compare_dictionaries(d1, d2). It should take two feature dictionaries d1 and d2 as inputs, and it should compute and return their log similarity score. Here is some pseudocode for what you will need to do:
Write a method similarity_scores(self, other) that computes and returns a list of log similarity scores measuring the similarity of self and other one score for each type of feature (words, word lengths, stems, sentence lengths, and your additional feature). You should make repeated calls to compare_dictionaries, and put the resulting scores in a list that the method returns.
Important: In each call to compare_dictionaries, the dictionary belonging to self should be the second parameter of the call. For example:
word_score = compare_dictionaries(other.words, self.words)
Finally, write a method classify(self, source1, source2) that compares the called TextModel object (self) to two other source TextModel objects (source1 and source2) and determines which of these other TextModels is the more likely source of the called TextModel.
You should begin by calling similarity_scores twice:
scores1 = self.similarity_scores(source1)
scores2 = self.similarity_scores(source2)
Next, print the two lists of scores, preceding each list of scores with the name of the source TextModel for example:
scores for Shakespeare: [-34.737, ...]
scores for J.K. Rowling: [-49.091, ...]
(Note: If you like, you can use the round function to round the printed scores to 2 or 3 places after the decimal, but doing so is not required.)
You should then use these two lists of scores to determine whether the called TextModel is more likely to have come from source1 or source2. One way to do this is to compare corresponding pairs of scores, and determine which of the source TextModels has the larger number of higher scores.
For example, imagine that the two sets of scores are the following:
scores1: [-34.737, -25.132, -55.312, -10.715, -47.125]
scores2: [-49.091, -21.071, -60.154, -16.502, -43.675]
scores1 has a higher score for three of the features (the ones in positions 0, 2, and 3 of the lists), while scores2 has a higher score for only two of the features (the ones in positions 1 and 4). Thus, we conclude that self is more likely to have come from source1.
You should also feel free to take an alternative approach to using the two lists of scores. For example, you could compute a weighted sum of the scores in each list by doing something like this:
weighted_sum1 = 10*scores1[0] + 5*scores1[1] + 7*scores1[2] + ...
weighted_sum2 = 10*scores2[0] + 5*scores2[1] + 7*scores2[2] + ...
You could then base your classification on which sources weighted sum is larger. One advantage of this approach is that it allows you to adjust the relative importance of the different features giving certain features a larger impact on the classification.
Your classify method should report its conclusions, using the names of the relevant TextModel objects. For example:
mystery is more likely to have come from Shakespeare
The method does not need to return anything.
Here is one function that you can use to test your TextModel implementation:
# Copy and paste the following function into finalproject.py
# at the bottom of the file, *outside* of the TextModel class.
def test():
""" your docstring goes here """
source1 = TextModel('source1')
source1.add_string('It is interesting that she is interested.')
source2 = TextModel('source2')
source2.add_string('I am very, very excited about this!')
mystery = TextModel('mystery')
mystery.add_string('Is he interested? No, but I am.')
mystery.classify(source1, source2)
Here is what we get when we run this function using our TextModel implementation:
>>> test()
scores for source1: [-16.394, -9.92, -15.701, -1.386, -1.386]
scores for source2: [-17.087, -15.008, -17.087, -1.386, -3.466]
mystery is more likely to have come from source1
Some of your scores will be different than ours (e.g., the third scores, which depend on how you stem the words, and the fifth scores, which depend on which additional feature you include). Our conclusion is based on a pairwise comparison of the scores: because source1 has a larger number of higher scores, it is chosen as the more likely source. If you use the lists of scores in another way, you may come to a different conclusion, which is fine!