it2051229 External Sorting and Searching

Overview:

Usually when you are sorting a list of objects, those objects are held in memory and can be easily manipulated. However this is not always the case. It is possible that you may want to sort a list that is larger than the amount of memory you have available. These situations are fairly common in larger databases. To sort in in this situation, you have to sort the information in stages, making sure that you never exceed your available memory at any point in time.

Part 1: External Sorting

In part 1, you will be given a single large file to sort in ascending order (lexicographically). For the purposes of this project the list will contain a list of strings to sort alphabetically, but the algorithm itself can be easily changed to work with any data type you wish. This file is a text file with each entry in one line. Sorting in this project will be done in several stages.

1. Read in the large file in chunks according to a specified number of lines

2. Sort the chunk using an nlogn sorting algorithm (e.g., quicksort, mergesort, heapsort) and output it as a temporary file (note that quicksort is not an nlogn sorting algorithm but acts like one in practice).

3. Open all temporary files and perform a kwise merge (defined later in this document) to an output file

Example:

For this example, let's imagine that our large file has 12 entries

abbc
aca
aab
aaa
aaac
acb
bac
acccccccc
aba
abb
bab
bbbaa

Assume we are told to sort in chunks of 3. We read in the file 3 lines at a time, sort the 3 items, then write the sorted list to a file. This is repeated until the file is empty.

aab abbc aca
aaa aaac acb
aba acccccccc bac
abb bab bbbaa

Finally, we perform a 4way merge. The method is very similar to the merging technique you have seen in class. The only difference is that instead of writing the smallest item in two lists to the sorted larger list, you pick the smallest of k items to write to the new list. The pseudocode for this is below:

kMerge(int k, followed by k lists):
Create k pointers pointing to the start of the k lists
Open an output file
While not done:
Write the smallest of the items pointed to to the output file
increment the pointer that was just used

Input/output format:

Your program will be called with the four additional command line arguments

The number 1 indicating you should run the code for part 1.
The name of the file to sort.
The name of the file where the sorted list should be stored.
The number of lines per chunk.
The directory to save the chunks in Note: You can use argv[2] argv[3] to access the second and third argument respectively. The argv[4] is the number of lines per chunk. Output of your program will be a single file which name is given as third argument. For example, ./program 1 testdata/inputtest1.txt outputtest1.txt 3 testdata/chunks/ Chunks should be named in the following format (exactly five digits): chunks/chunk< chunk number > READ NEXT PAGE for example of naming chunk The four files created in this example would be
chunks/chunk00001
chunks/chunk00002
chunks/chunk00003
chunks/chunk00004

Note: You may assume the output directory for final sorted file and chunks will exist. Let's run through that with our given example. Pointers will be represented as underlined strings: Stage 0:

Temporary Files
aab abbc aca
aaa aaac acb
aba acccccccc bac
abb bab bbbaa

Output

Stage 1: Select the first element and write to output. Increment that pointer

Temporary Files
aab abbc aca
aaa aaac acb
aba acccccccc bac
abb bab bbbaa

Output

aaa

Stage 2: Repeat a second time.

Temporary Files
aab abbc aca
aaa aaac acb
aba acccccccc bac
abb bab bbbaa

Output

aaa aab

Continue n times, until all pointers are exhausted:

Stage n: The sort is complete

Temporary Files
aab abbc aca
aaa aaac acb
aba acccccccc bac
abb bab bbbaa

Final Output

aaa
aaac
aab
aba
abb
abbc
aca
acb
acccccccc
bab
bac
bbbaa

Part 2: Searching in a large file

In the next part, you will implement the ability to search for items in the large sorted list. Searching in a large file is a bit trickier than searching in memory. Use binary search, but instead of searching in memory, you will search in a file. To accomplish this, you will need to move the file pointer around until you find the desired item. The command to do this with a file pointer in C++ is fseek(). You can read its documentation at http://www.cplusplus.com/reference/cstdio/fseek/ .

A second tricky part of this problem is that the file pointer does not move according to lines. It instead jumps to a character offset in the file. To get around this problem, when you jump to a position, you will need to look a few characters ahead or behind to find the next newline character. Take special care at the beginning of the file, the end of the file, and the endpoints of the search segment. You may assume the last line ends with a new line('n').

Input/Output Format (Part 2) : Your program will be called with 2 as the first argument and the path to the file as the second argument and additional command line arguments for each search term.

Note: You can get the length of input argument with argc F

or example to search in the file produced in part 1, ./program 2 outputtest1.txt aaa aba car

Output of your program will be n lines containing either found or missing

Sample Output:

found
found
missing

Note: there should be 'n' after each search answer (even the last one no special case for it). (in addition to the correctly stored out1.txt. Please see the provided testcases on Piazza for this file)

Things to note:

There is no guarantee that the input file is an even multiple of the chunk size. Your last chunk may be a bit smaller.
We will use test cases where you will be unable to load the entire list into memory at the same time. However you will have plenty of memory to handle one chunk at a time and deal with the kwise merge.
Your search algorithm should run in log(n) time.
Your memory and time will be limited in the testcases. Ensure that your program meets efficiency requirements.
Be sure to free or delete any memory you are finished using.

Academic Honesty!

It is not our intention to break the school's academic policy. Posted solutions are meant to be used as a reference and should not be submitted as is. We are not held liable for any misuse of the solutions. Please see the frequently asked questions page for further questions and inquiries.

Kindly complete the form. Please provide a valid email address and we will get back to you within 24 hours. Payment is through PayPal, Buy me a Coffee or Cryptocurrency. We are a nonprofit organization however we need funds to keep this organization operating and to be able to complete our research and development projects.