In this homework, you have been given a selection of random biography pages from Wikipedia. This information can be found in the file "206_hw5_wiki_bios.txt". Your assignment is to use regular expressions to extract information from these biographies. To be clear, each function (except for read_file) must pull the appropriate pieces from the text using regex.
To do so, you will complete the following functions in HW5.py:
This function returns a dictionary with the keys being numbers (1 - 10) and the values being the names of each biography subject. This function should use a regular expression to find the biography pattern and then add each name to a dictionary with the keys representing that name's position in the list of biographies and the values being the names themselves.
The expected output should be in the format:
{1: "Mike Kearney", 2: "Margit Symo", ... }
This function finds all possessives used in the text file and then returns them in a list. A word counts as a possessive if it includes letters before and after an apostrophe. Valid possessives might include: Holden's, Julias, etc. Note that there are many apostrophes present in the text file that dont meet these criteria.
This function finds and returns all the section headings which match the following conditions:
These are examples of valid section headings:
==Albums==
===Producer Compilation Albums===
This function returns a dictionary where the keys are the names of the biography subjects and the values are integers representing each subject's year of birth. Where the year of birth is unknown, you should save the string 'unknown instead of a year.
Example:
{'Mike Kearney': 1953, 'Alexander Champion': unknown, etc.}
Write a function count_mid(string_list, middle) to return a count of the number of times a specified string appears in a file. It should match the string that is in the middle of a word (not the beginning or the end). For example, if called with "be" it should match "number" but not "vibe". Make sure to account for punctuation (e.g., ',' or ?) in your regular expression. You MUST use a regular expression to earn credit for this part. (We will not be checking if you make tests for the extra credit, but feel free to write your own tests if it will help you complete this problem!)
# Your name:
# Your student id:
# Your email:
# List who you have worked with on this homework:
import re, os, unittest
def read_file(filename):
""" Return a list of the lines in the file with the passed filename """
# Open the file and get the file object
source_dir = os.path.dirname(__file__) #<-- directory name
full_path = os.path.join(source_dir, filename)
infile = open(full_path,'r', encoding='utf-8')
# Read the lines from the file object into a list
lines = infile.readlines()
# Close the file object
infile.close()
# return the list of lines
return lines
def find_bio_names(string_list):
"""
This function returns a dictionary with the keys being numbers, (1 - 10)
and the values being the names of each biography subject
"""
pass
def find_possessives(string_list):
"""
This function finds all (real, English language) words with an apostrophe in them
"""
pass
def find_section_headings(string_list):
"""
This functions returns a list of section headings in the list of strings
"""
pass
def find_birth_years(string_list):
"""
This function returns a dictionary where the keys are names and the values are corresponding birth years
If the birth year is unknown, use the string 'unknown' in place of a birth year
Hint: you could call your find_bio_names function here to help
"""
pass
## Extra credit
def count_mid(string_list, middle):
"""
This function returns a count of the number of times a specified string appears
in the text. The matched string should be in the middle of a word, not at
the start of end
"""
pass
#Implement your own tests
class TestAllMethod(unittest.TestCase):
def test_find_bio_names(self):
pass
def test_find_possessives(self):
pass
def test_find_section_headings(self):
pass
def test_find_birth_years(self):
pass
#Uncomment if working on Extra Credit
#def test_count_mid(self):
# pass
def main():
#Feel free run your functions here as well!
if __name__ == '__main__':
main()
print()
unittest.main(verbosity=2)