Query Evaluation Steps

  1. SQL query
    • parse query into an internal format, usually a tree
    • perform lots of checks using catalog, etc.
      • syntax
      • object (database, schema, table) names
      • constraints
      • permissions
      • etc
  2. Parse and rewrite queries
    • view rewrite
    • subquery flattening
    • heuristic
    • etc
  3. Choose a logical plan: extended relational algebra tree
    • include more operators that are closer to physical implementation than pure relational algebra operators
    • duplicate elimination operator
    • grouping and aggregation operator
    • sorting
    • etc
  4. Choose a physical plan: logical plan with detailed annotations at each node
    • access method to use for each relation
    • implementations to use for each operator
    • scheduling decisions for operators
  5. Execute the plan
    • synchronize operators and pass data between operators
    • usually use iterator interface plus either
      • pipelined execution or
      • intermediate results materialization

Access Methods

  • File scan: scan tuples one at a time
  • Index scan: clustered or unclusterd
    • Hash index: efficient selection on equality predicates
    • Tree index: efficient selection on equality or range predicates

Cost Model

  • Usually I/O, CPU, and network bandwidth

Access Path Selectivity

It can be defined in multiple ways. You can consider it as the number of pages retrieved if we use this access path. Most selective retrieves fewest pages. Or as the factor used to reduce tuples from results, the smaller the factor, the more pages it will disgard.

  • Selectivity for equality search: reduction factor $\frac{1}{number\ of\ distict\ values}$
  • Selectivity for range search: reduction factor $\frac{selected\ range}{total\ range}$
  • Selectivity with multiple conditions: either by multiplication, or by addition

Join Operator Implementation

  • Logical operator: Join
  • Physical operator: the implementations of Join
One Pass Index Based Two Pass
Hash Join   Partitioned/External Hash Join
Nested Loop Join Index Nested Loop Join  
Sort-Merge Join   External Sort-Merge Join

When operations fit in memory, we can do one pass algorithms as shown in the table. Otherwise, we have to do join in separate passes with the help of either partitioned/external hashing or external sorting.

Hash Join

  • Scan R, build hash table in main memory
  • Then scan S, serach hash table and join

Nested Loop Join

Do nested loop on tuple level.

for tuple r in R do 
	for tuple s in S do 
		if r and s can join
		then output (r, s)

Sort-Merge Join

  • Scan R and sort in memory
  • Scan S and sort in memory
  • Merge R and S

Index Nested Loop Join

  • Assume S has an index on join attribute
  • Iterate over R, for each tuple fetch corresponding tuple from S through index
  • Cost may vary. Depends on index clustered or unclusterd

External/Partitioned Hash Join

External hashing is used in a variety of places not requiring order, including GROUP BY, DISTINCT and some operators like join.

  • Hash R using $h_1$ into buckets, and write to disk.
  • Hash S using $h_1$ into the same number of buckets, and write to disk
  • Load in a partition of R, build hash table using fine-grained $h_2$
  • Load in the corresponding partition of S, probe hash table and join.

External Sort-Merge Join

External sort is used in a variety of places, including ORDER BY, bulk-loading into B+ Tree, and some operators like join.

Sorting:

  • In first pass, load pages of data, sort using some sorting algorithm, and produce sorted runs
  • In later passes, merge sorted runs

Buffering

To reduce I/O wait cost, use double buffering for sorting/merging.

Join:

  • Sort R and S on the join attributes
  • Read both relations in sorted order, and join