To aid with a safe return to an in-class setting as the Covid-19 pandemic begins to subside, a university is planning to do extensive testing of people to see whether they have the virus, along with contact tracing, when new cases are identified. To aid in contact tracing, they are considering monitoring which students and employees are in which university locations when. The results will be stored in a database . When there is reason to believe that someone has 1 potentially become contagious (e.g. because they had a positive test or because they developed symptoms), this will be noted in the database and others who were near them will be identified so they can also be tested. You have been hired to help develop a prototype for the database and some queries and to do some performance analysis.
Rather than tracking exact locations and exact times when people are in locations, the designers have decided to "discretize" the problem, by dividing the campus into a set of locations, each with a unique locID and dividing time into a set of time slices, each with a unique tsID. For example a locID might represent a particular classroom or areas might be divided on a finer granularity with a locID representing part of a classroom. We'll assume that each spot on campus is part of only one location. Among other attributes, each person has a 2 unique pID and a status. The status attribute could indicate for example, that the person is 'OK, exposed, tested positive, recovered, etc.
An ER diagram for some aspects of the data model is shown on the next page.
Figure: see image.
It has entity sets representing People, Locations, and TimeSlots and a relationship set, WasIn, that represents who was where when. In addition, WasIn has an attribute that indicates what activity the person was doing in that location at that time (such as eating, listening to a lecture, engaging in a discussion class, etc). Note that WasIn is a ternary relationship set. Here is (an incomplete) SQL table definition for the corresponding table:
CREATE TABLE WasIn (
pID
INT,
locID INT,
tsID
INT,
ActivityCode INT
}
1. Add the PRIMARY KEY and FOREIGN KEY CONSTRAINTS to the definition of the WasIn Table. (Write your answer to the CREATE TABLE statement shown on the previous page.)
2. Choose one of the entity sets in the ER diagram and write SQL CREATE TABLE for it. If the data types aren't already determined by those shown above, choose reasonable data types for the attributes. Include any needed PRIMARY KEY and/or FOREIGN KEY CONSTRAINTS. Suggestion: You might want to look ahead to Problems 3 and 4, as youll need other table(s) for them.
There are several deficiencies of the current design. Among them:
Show how to modify the ER diagram to either
State which of these options you're choosing and clearly show all changes to the ER diagram. (If youd prefer, you may identify some other deficiency, describe it clearly, and add to the ER diagram to rectify it; it should be a change that involves adding a relationship set.)
Write your answer onto the ER diagram on the page 2 or add another page and write the whole modified ER diagram.
When someone tests positive for the virus, in addition to updating their status, we want to identify all people who were in the same location as them during the same time slice any time in the last 3 days. Write an SQL query to find the name and contactInfo of each person who was 3 in the same location at the same time as the person with pID = x during (roughly) the last 3 days. Use the last four digits of your N-number as x.
Hint: The query could (a) join two copies of WasIn OR (b) involve a copy of WasIn in the main query and another one in a subquery. Be careful about the join condition (if you're using approach (a)) or the attributes in the IN clause (if youre using approach (b)).
Think of some question to ask about the data, involving grouping and aggregation. Describe the question clearly in lay-person's terms and write an SQL query to answer the questions. Briefly describe how your query addresses your question. Heres an example, but do NOT use this one: find the number of people who have status "recovered" in each location during some specific time slot. (If youd like, you may define and use additional tables, such as those corresponding to your additions to the ER diagram in Problem 2).
Assume the WasIn table occupies 32 million (32,000,000) disk blocks and its records are stored sorted in lexicographic order by (tsID, locID).
Assume that there is B+ tree index on the pairs (tsID, locID). In other words, the search keys are pairs consisting of a tsID and a locID. Assume this tree has height 4 (the root, three other levels of internal nodes, and the leafs).
Assume that each block access (seek plus rotational latency, which we've been denoting tS ) takes about 10 ms and that each block transfer (which weve been denoting tT ) takes about 0.1 ms.
Consider the following query:
SELECT pID FROM WasIn WHERE locID = 1234 AND tsID = 9678
1. Write a sentence that a layperson could understand, explaining what the query finds.
2. Estimate the time to execute the query using each of the following algorithms. (Your answers should show formulas in terms of tS, tT, and the relevant actual numbers of blocks that need seeks and transfers AND also should show the time in milliseconds, under the assumptions given):
a. Do a sequential scan of the entire WasIn file, checking each record to see whether it has the given llocID and tsID.
Number of seeks:
Number of block transfers:
Formula:
Time in milliseconds:
b. Use the B+ tree index to search for the given locID andtsID. Once it's found, scan some additional contiguous blocks of the file. Let bFound denote the number of blocks that have data with locID 1234 and tsID 9678 and assume this number is less than 3.
Number of seeks of blocks of the B+ tree:
Number of block transfers of blocks of the B+ tree:
Number of seeks of blocks of the WasIn data file:
Number of block transfers of blocks of the WasIn data file:
Formula:
Time in milliseconds:
3. Briefly, but clearly explain why the algorithm in 2b works: Why is it sufficient to do only one search with the tree and then scan a few contiguous blocks in order to find all the records that satisfy the condition in the WHERE clause?