Usually when you are sorting a list of objects, those objects are held in memory and can be easily manipulated. However this is not always the case. It is possible that you may want to sort a list that is larger than the amount of memory you have available. These situations are fairly common in larger databases. To sort in in this situation, you have to sort the information in stages, making sure that you never exceed your available memory at any point in time.
In part 1, you will be given a single large file to sort in ascending order (lexicographically). For the purposes of this project the list will contain a list of strings to sort alphabetically, but the algorithm itself can be easily changed to work with any data type you wish. This file is a text file with each entry in one line. Sorting in this project will be done in several stages.
1. Read in the large file in chunks according to a specified number of lines
2. Sort the chunk using an nlogn sorting algorithm (e.g., quicksort, mergesort, heapsort) and output it as a temporary file (note that quicksort is not an nlogn sorting algorithm but acts like one in practice).
3. Open all temporary files and perform a kwise merge (defined later in this document) to an output file
Example:
For this example, let's imagine that our large file has 12 entries
abbc
aca
aab
aaa
aaac
acb
bac
acccccccc
aba
abb
bab
bbbaa
Assume we are told to sort in chunks of 3. We read in the file 3 lines at a time, sort the 3 items, then write the sorted list to a file. This is repeated until the file is empty.
aab abbc aca
aaa aaac acb
aba acccccccc bac
abb bab bbbaa
Finally, we perform a 4way merge. The method is very similar to the merging technique you have seen in class. The only difference is that instead of writing the smallest item in two lists to the sorted larger list, you pick the smallest of k items to write to the new list. The pseudocode for this is below:
kMerge(int k, followed by k lists):
Create k pointers pointing to the start of the k lists
Open an output file
While not done:
Write the smallest of the items pointed to to the output file
increment the pointer that was just used
Input/output format:
Your program will be called with the four additional command line arguments
Note: You may assume the output directory for final sorted file and chunks will exist. Let's run through that with our given example. Pointers will be represented as underlined strings: Stage 0:
Temporary Files
aab abbc aca
aaa aaac acb
aba acccccccc bac
abb bab bbbaa
Output
Stage 1: Select the first element and write to output. Increment that pointer
Temporary Files
aab abbc aca
aaa aaac acb
aba acccccccc bac
abb bab bbbaa
Output
aaa
Stage 2: Repeat a second time.
Temporary Files
aab abbc aca
aaa aaac acb
aba acccccccc bac
abb bab bbbaa
Output
aaa aab
Continue n times, until all pointers are exhausted:
Stage n: The sort is complete
Temporary Files
aab abbc aca
aaa aaac acb
aba acccccccc bac
abb bab bbbaa
Final Output
aaa
aaac
aab
aba
abb
abbc
aca
acb
acccccccc
bab
bac
bbbaa
In the next part, you will implement the ability to search for items in the large sorted list. Searching in a large file is a bit trickier than searching in memory. Use binary search, but instead of searching in memory, you will search in a file. To accomplish this, you will need to move the file pointer around until you find the desired item. The command to do this with a file pointer in C++ is fseek(). You can read its documentation at http://www.cplusplus.com/reference/cstdio/fseek/ .
A second tricky part of this problem is that the file pointer does not move according to lines. It instead jumps to a character offset in the file. To get around this problem, when you jump to a position, you will need to look a few characters ahead or behind to find the next newline character. Take special care at the beginning of the file, the end of the file, and the endpoints of the search segment. You may assume the last line ends with a new line('n').
Input/Output Format (Part 2) : Your program will be called with 2 as the first argument and the path to the file as the second argument and additional command line arguments for each search term.
Note: You can get the length of input argument with argc F
or example to search in the file produced in part 1, ./program 2 outputtest1.txt aaa aba car
Output of your program will be n lines containing either found or missing
Sample Output:
found
found
missing
Note: there should be 'n' after each search answer (even the last one no special case for it). (in addition to the correctly stored out1.txt. Please see the provided testcases on Piazza for this file)
Things to note: