ummary: Counting all the k-mers (substrings of length k) in DNA/RNA sequencing reads is the preliminary step of many bioinformatics applications. However, state of the art k-mer counting methods require that a large data structure resides in memory. Such structure typically grows with the number of distinct k-mers to count. We present a new streaming algorithm for k-mer counting, called DSK (disk streaming of k-mers), which only requires a fixed, user defined amount of memory and disk space. This approach realizes a memory, time and disk trade-off. The multi-set of all k-mers present in the reads is partitioned and partitions are saved to disk. Then, each partition is separately loaded in memory in a temporary hash table. The k-mer counts are returned by traversing each hash table. Low abundance k-mers are optionally filtered. DSK is the first approach that is able to count all the 27-mers of a human genome dataset using only 4.0 GB of memory and moderate disk space (160 GB), in 17.9 hours.
This lab is intended to study variations of hashing. Since the interest is primarily algorithmic efficiency, we will ignore physical device considerations. You will have 2 hashing schemes: mine and yours. Mine has three variations. You will have three collision handling strategies: linear probing, quadratic probing, and chaining. You will have two variations on the table structure: bucket of size one or buckets of size three. The table below shows all the combinations that you need to handle.
Hashing Scheme | Bucket Size | Collision Handling | Print Data Items Accross | |
1 | Division modulo 120 | bucket size = 1 | Linear Probing | 5 |
2 | Division modulo 120 | bucket size = 1 | Quadratric Probing | 5 |
3 | Division modulo 120 | bucket size = 1 | Chaining | 5 |
4 | Division modulo 113 | bucket size = 1 | Linear Probing | 5 |
5 | Division modulo 113 | bucket size = 1 | Quadratic Probing | 5 |
6 | Division modulo 115 | bucket size = 1 | Chaining | 5 |
7 | Division modulo 41 | bucket size = 3 | Linear Probing | 3 |
8 | Division modulo 41 | bucket size = 3 | Quadratic Probing | 3 |
9 | Your hashing scheme | bucket size = 1 | Linear Probing | 5 |
10 | Your hashing scheme | bucket size = 1 | Quadratic Probing | 5 |
11 | Your hashing scheme | bucket size = 1 | Chaining | 5 |
ANALYSIS: The written analysis is extremely important. Discuss the ramifications of the different hashing and collision resolution techniques. Compare the schemes and figure out what is good and bad about each one. What would you do to address some of the problems you encountered? What did you learn about load factors? What are the ramifications of deleting items from your hash table? Why is it useful and relevant to Bioinformatics?