Message Frequency Count
Write a program that reads through the mail box data and when you find a line that starts with "From", extract the address information from the line. Count the number of messages from each person by using a dictionary. Note that you might need to look at more than From because of duplicate instances of the address (hint: From vs. From:). Otherwise, embedded email thread histories may cause your count to be incorrect.
After all of the data has been read, print (i.e., print) the address of the person with the highest number of messages, along with the number of messages of that person. To do this, create a list of tuples (count, email) from the dictionary, sort the list in reverse order and print out the person who has the highest number of messages.
Note: To succeed with this assignment, know your data! When your program counts the messages, how do you know if you are counting the messages correctly? Could you be counting the same message more than once? If your program were to operate correctly, how would you know it? Is there a smaller file that you can use to test your program? (Hint: There's a file named mbox-short.txt.)
URL Reader
Rename the socket1.py program from our textbook to URL_reader.py.
Modify the URL_reader.py program to use urllib instead of a socket. This code should still process the incoming data in chunks, and the size of each chunk should be 512 characters. The idea is that to allow for the processing of very large files, which may be too large to fit into working memory. You cannot use a chunk (buffer) that grows beyond the 512 character limit! Do not use the value 512 as a magic number. Rather, it should be established as a named constant.
Add code that prompts the user for the URL so it can read any web page.
Add error checking using try and except to handle the condition where the user enters an improperly formatted or non-existent URL.
Count the number of characters received, and stop displaying any text after it has shown exactly 3000 characters. Space characters, tab characters, and newline characters are characters, and should therefore be included in your count. Since your chunk size will not divide evenly into 3000, you will need to add some logic to ensure that exactly 3000 characters (no more, and no less) are displayed. When you print your blocks, it is okay for them to be separated by a newline character. Do not use magic numbers! The values 3000 and 512 should be established as named constants. Any other values that you might use should be derived from these two named constants. The idea is for to allow for the values 3000 and 512 to be adjusted (to 5000 and 256, for example), and still ensure that your code operates correctly.
Continue to retrieve the entire document, count the total number of characters, and display (i.e., print) the total number of characters.
Note: Because you are printing characters one chunk at a time, and you need to stop printing when you reach 3000 characters, there's a point where you will need to print only the portion of the chunk that enables you to reach the 3000 character print limit. This printing of calculated portion of the last printable chunk is arguably the most challenging aspect of this assignment.