Cost estimation provides the foundation for query optimization by estimating the resource requirements of different query execution plans. Query optimization uses these cost estimates to select the most efficient plan, leading to faster query processing and better overall system performance.
Understanding Cost Estimation
Cost estimation refers to the process of determining the resources required to execute a database query. The DBMS uses cost estimation to analyze multiple execution plans for a query and selects the one that minimizes resource usage.
- At this stage, no commitment to a particular physical plan
- Estimate the cost of each operator in terms of the size relation(s) on which it operates
- Choose a logical query plan that minimises the size of the intermediate relations (= minimises the cost of the plan)
- Assumption: system catalogue stores statistics about each relation
The Cost Estimation Process
The process of cost estimation typically involves the following steps:
-
Parsing: The query is parsed, and a syntax tree is generated.
-
Query Rewriting: The syntax tree is rewritten to optimize logical operations and to make it suitable for cost evaluation.
-
Plan Generation: The optimizer generates potential execution plans, which are different strategies that the database can use to execute the query.
-
Cost Evaluation: Each execution plan is evaluated for cost based on statistical information about the database contents.
-
Plan Selection: The plan with the lowest estimated cost is chosen for execution.
Factors Affecting Cost Estimation
-
Data Distribution: The distribution of data, including the number of rows in a table and the distinct values of columns, can significantly affect the cost.
-
Indices: The presence or absence of indices on the columns involved in the query can drastically change the cost estimation.
-
Join Operations: Different join methods (e.g., nested loops, hash joins, merge joins) have varying costs depending on the size of the tables and the existence of indices.
-
Query Complexity: The number of operations in a query, such as subqueries, aggregations, and sorting, can influence the cost.
The Core of Cost Estimation: Statistical Information of Relation
- In Cost Estimation for databases,
T(R)
andV(R, A)
are two commonly used notations to denote statistical information about a relationR
.T(R)
: Number of tuples in relationR
(cardinality ofR
)V(R, A)
: Number of distinct values for attributeA
in relationR
- Note: for any relation
R
,V(R, A) ≤ T(T)
for all attributesA
onR
Transformation Rules
- Scan:
- Operation of reading all tuples of a relation
T(scan(R)) = T(R)
- For all
A
inR
,V(scan(R),A) = V(R,A)
- Product:
T(RxS)=T(R)T(S)
- For all
A
inR
,V(RxS,A) = V(R,A)
- For all
B
inS
,V(RxS,B) = V(S,B)
- For all
- Projection
T(πA(R)) = T(R)
V(πA(R)),A) = V(R,A)
- Assumption: projection does not remove duplicate tuples (value counts don’t change)
- Selection (attr=val):
T(σ A=c (R)) = T(R) / V(R, A)
V(σ A=c (R),A) = 1
- Assumption: all values of
A
appear with equal frequency
- Assumption: all values of
- Conjunction:
T(σ A=c1⋀B=c2 (R)) = T(R) / V(R, A) V(R, B)
- Disjunction:
T(σ A=c1∨B=c2 (R)) = T(R) / V(R, A) + T(R) / V(R, B)
- This overestimates the number of tuples. Alternatively,
T(σ A=c1∨B=c2 (R)) = T(R) (1 - 1 / V(R, A)) (1 - 1 / V(R, B))
- This overestimates the number of tuples. Alternatively,
- Selection (attr=attr):
T(σ A=B (R)) = T(R) / max(V(R, A), V(R, B))
V(σ A=B (R), A) = V(σ A=B (R),B) = min(V(R, A),V(R, B))
- Assumption: all values of
A
appear with equal frequency. And all values ofB
appear with equal frequency
- Assumption: all values of
- Join: