Distributed Query Processing

In Distributed Query Processing, the process of querying data across multiple networked sites (or nodes) is optimized to manage the additional challenges of data distribution. Here are some core points about how it differs from centralized systems:

Cost Concerns:
- In centralized systems, query processing costs are mostly about disk access time—how quickly the system can retrieve data from storage.
- In a distributed system, we also need to account for data transmission costs over the network. This is because data might need to move between sites, which introduces additional delays and bandwidth costs.
Network Data Transmission:
- Each time data is transmitted between sites, there is a potential bottleneck since network speeds can vary.
- The system must carefully decide where and how much data to transmit to avoid overloading the network, especially if multiple queries are processed simultaneously.
Query Speedup Potential:
- Distributed systems can potentially speed up query processing by leveraging parallel processing. Since multiple nodes can process different parts of a query simultaneously, the overall query response time can be reduced.
- However, this speedup is only effective if network transmission costs are managed carefully. If the network delays are too high, it can offset the gains from parallelism.

Query Transformation

In Query Transformation within distributed databases, the main task is to rewrite a query so it efficiently retrieves data spread across multiple sites. When relations are fragmented (split across different locations based on certain criteria), the database must reconstruct these pieces or decide which fragments to access to answer a query.

Example of Query Transformation with Horizontal Fragmentation

Let’s say we have a table called account, but it’s horizontally fragmented across two sites based on branch location:

account1: contains records where branch_name = "Hillside".
account2: contains records where branch_name = "Valleyview".

Now, suppose we want to execute the following query: SELECT * FROM account WHERE branch_name = "Hillside"; This query needs only the records from the Hillside branch. In a centralized database, the query would directly access account and filter by branch_name. But in a distributed setting, the database will:

Rewrite the Query: The original query, SELECT * FROM account WHERE branch_name = "Hillside", is transformed to consider all fragments: SELECT * FROM (account1 UNION account2) WHERE branch_name = "Hillside";
Optimize the Query: Instead of retrieving data from both fragments, the query is optimized to: SELECT * FROM account1 WHERE branch_name = "Hillside";

Simple Join Processing

When performing joins in a distributed database where tables are stored at different sites, Simple Join Processing involves strategies to efficiently move and process data across these sites. Here’s a breakdown of how the process works and how semijoins can optimize it.

Example Scenario

Consider three tables:
- account is stored at site S1.
- depositor is stored at site S2.
- branch is stored at site S3. Suppose we need to join these three tables for a query initiated at site S1 and output the final result at site S1.

Possible Strategies for Processing

Strategy 1: Ship all data to Site S1.
- Move all copies of account, depositor, and branch tables to site S1.
- Perform the join locally at S1.
- Pros: Straightforward, as all data is available in one place.
- Cons: Potentially expensive in terms of data transfer if the tables are large, as it involves moving all data across the network.
Strategy 2: Process joins step-by-step across sites.
- Step 1: Ship a copy of account from S1 to S2.
- Step 2: Compute temp1 = account ⨝ depositor at S2.
- Step 3: Ship temp1 to S3 and compute temp2 = temp1 ⨝ branch at S3.
- Step 4: Ship temp2 from S3 back to S1 for the final result.
- Pros: Reduces data movement by joining only relevant data at each step.
- Cons: Still may involve shipping large intermediate results, and processing speed at each site may affect performance.
Semijoin Strategy:
- In a semijoin, only the necessary rows from each relation are shipped between sites, reducing the amount of data transferred.
  - Step 1: Project the columns in account that are needed to join with depositor, creating temp1 at S1 (select only keys that match across tables).
  - temp1 = π_{common_columns}(account) at S1.
  - Step 2: Ship temp1 to S2.
  - Step 3: At S2, join depositor with temp1 to filter only the rows in depositor that match with account, creating temp2.
  - temp2 = depositor ⨝ temp1 at S2.
  - Step 4: Ship temp2 back to S1.
  - Step 5: At S1, join account with temp2 to produce the final result.
Formal Definition of a Semijoin
- The semijoin of r1 with r2 (written as r1 ⋉ r2) selects the tuples of r1 that would contribute to the final join result with r2.
- In formal terms, the semijoin r1 ⋉ r2 is:
  - π_{R1}(r1 ⨝ r2)
  - where π_{R1} selects only the relevant columns from r1.