In multi-threads programming, threads can perform the same or different roles. In some multithreading scenarios like producer-consumer, producer and consumer threads have different functionality. In other multithreading scenarios like many parallel algorithms, threads have almost the same functionality. In this programming assignment, we want to work on both sides in problem "word count statistics", which is the multithreaded version of MapReduce application in programming assignment 2.
An example problem is shown in Fig. 1 below. A text file contains several lines of English characters, and this application will count the histogram of the starting letter for each word.
Fig. 1, an example of "word count statistics" see image.
One idea to solve this problem is to cut the file into smaller pieces and hand it over to many workers. Then in this programming assignment, we will use multithreading to create a producer to read the file and multiple consumers to process the smaller piece data. We will have two forms of synchronization: a shared queue synchronized between producer and consumers, and a global result histogram synchronized by consumers.
Topics covered: POSIX-threading, synchronization, producer-consumer model, file IO and string processing.
In multithreading "word count statistics" application, the program contains: a master thread, one producer thread and many consumer threads (shown in Fig. 2 below). The producer and consumers will share a queue for data transfering. For program execution flow, the entire program contains 4 parts: 1) the master thread initializes the shared queue, result histogram, producer thread and consumer threads; 2) the producer thread will read the input file, cut the data into smaller pieces and feed into the shared queue; 3) the consumers will read from the shared queue, compute the word count statistics for its data pieces and synchronize the result with a global histogram; 4) after producer and all consumers complete their work, the master thread writes the final result into the output file.
Fig. 2, 4 parts of multithreading "word count statistics" see image.
In the next section, we will go through the implementation details of this project. The implementation requirements will be provided.
Before we go into each part, the application comparison between PA2 (Programming Assignment 2) and PA3 (Programming Assignment 3) needs to be clarified. Even though we use the same application "word count statistics" in PA2, there are several key differences in the input file and processing flows: 1) PA2 works on different files located in different directories, but PA3 takes only one file as your input file. 2) In PA2, each file contains multiple lines and each line contains only one word. But PA3's input file is more natural , which contains multiple words in each line and the file contains multiple lines. 3) PA2 uses multi-processing, which means the parent forks several children and seperate the workload. But in PA3, you only have one process but multiple threads . PA3 is mainly focused on the usage of threads and thread-safe data structures.
Shared Queue
The core of this multithreading "word count statistic" application is a thread-safe shared queue. This shared queue should be implemented as a linked-list unbounded buffer . The producer inserts the data in the tail of the linked-list and the consumer extracts the data from the head. Also, it should be implemented in a non-busy-waiting way (use mutex lock + conditional variable or semaphore).
Master Initialization
In this stage, the program needs to check the input arguments and print error messages if argument mismatch (see Section 6 execution for arguments). Then it should perform the initialization on data structure and launch the producer/consumers thread. Then the master will wait for all threads to join back (4.4 Master Finalize).
Producer functionality
The main functionality of the producer thread is to read the input file and pass the data into the shared queue (by a package). The file is required to be read by line granularity (one line at a time), thus each consumer will work on one line at a time. Since there are multiple consumers, if the EOF is reached, the producer should send notifications to consumers specifying there will be no more data. The producer terminates after sending those notifications. Note that the package is the information transferred between producer/consumers via shared queue. It should contain the data for consumers and other information if needed.
Consumer functionality
The consumer will repeatedly check the queue for a new package, work on it for word count statistics , and then go back to get the next package. This will continue until receiving the EOF notification, then it will terminate. Besides package handling, it's the consumers responsibility to synchronize the result with the global final histogram. Then after all consumers finish their work, the master thread could summarize it and generate the output.
The word count statistics will check the beginning character of a word. The definition of a word is a continuous a-zA-Z character (shown in Fig. 3 below). Note that, all other characters are treated as space. The characters like "-", _ are not connecting words. Same as PA2, we don't differentiate between uppercase and lowercase letters.
Fig. 3, the word count statistics function see image.
Master Finalize
After the producer and consumers have joined back, the master thread will write the final result (global histogram) into the output file named "result.txt" . The output format should be: "%c: %d\n" , and it will loop through the global histogram from 'a' to z (shown in Fig. 4 below).
Fig. 4, the output file format see image.
Log Printout
The program will also print a log file if the argument option is specified . The log file should be named as "log.txt" . The producer and consumers should print their execution information into the log:
Producer:
1. Print "producer\n" when the producer is launched
2. Print "producer: %d\n" for the current line number (starts from 0)
Consumer:
1. Print "consumer %d\n" when launched, with the consumer id (0 to number of consumers minus 1)
2. Print "consumer %d: %d\n" for the consumer id and the line number it currently works on.
There are some other notes when writing the log:
The extra credits will be granted if a bounded buffer shared queue is implemented. The application could choose the unbounded/bounded buffer by the commandline argument option. Also the bounded buffer should have a configurable buffer size specified by commandline argument option.