Ø using client-server
architecture
Ø user creates query
Ø client parses and
sends to server(s) (SQL?)
Ø servers return
appropriate Tables
Ø client combines
into one Table
Ø Issue of data
transfer cost over a network
o
optimise the query to transfer the least amount
Query
Processing Components
Ø Query language
that is used
o SQL:
“intergalactic dataspeak”
Ø Query execution
methodology
o The steps that
one goes through in executing high-level (declarative) user queries.
Ø Query
optimization
o
How do we determine the “best” execution plan?
Query
Optimization Objectives
Ø Minimize a cost
function
§ I/O cost + CPU
cost + communication cost
Ø These might have
different weights in different distributed environments
Ø Wide area
networks
o communication
cost will dominate
§ low bandwidth
§ low speed
§ high protocol
overhead
o most algorithms
ignore all other cost components
Ø Local area
networks
o communication
cost not that dominant
o total cost
function should be considered
Ø Can also maximize
throughput
Query
Optimization Issues – Types of Optimizers
Ø Exhaustive search
o cost-based
o optimal
o combinatorial
complexity in the number of relations
Ø Heuristics
o not optimal
o regroup common
sub-expressions
o perform
selection, projection first
o replace a join by
a series of semijoins
o reorder
operations to reduce intermediate relation size
o optimize
individual operations
Optimization
Granularity
Ø Single query at a
time
o cannot use common
intermediate results
Ø Multiple queries
at a time
o efficient if many
similar queries
o decision space is
much larger
Optimization
Timing
Ø Static
o compilation Þ
optimize prior to the execution
o difficult to
estimate the size of the intermediate results
Þ error propagation
o can amortize over
many executions
o R*
Ø Dynamic
o run time
optimization
o exact information
on the intermediate relation sizes
o have to
reoptimize for multiple executions
o Distributed
INGRES
Ø Hybrid
o compile using a
static algorithm
o if the error in
estimate sizes > threshold, reoptimize at run time
o MERMAID
Statistics
Ø Relation
o cardinality
o size of a tuple
o fraction of
tuples participating in a join with another relation
Ø Attribute
o cardinality of
domain
o actual number of
distinct values
Ø Common
assumptions
o independence
between different attribute values
o uniform
distribution of attribute values within their domain
Decision Sites
Ø Centralized
o single site
determines the “best” schedule
o simple
o need knowledge
about the entire distributed database
Ø Distributed
o cooperation among
sites to determine the schedule
o need only local
information
o cost of
cooperation
Ø Hybrid
o one site
determines the global schedule
Ø each site
optimizes the local subqueries
Network
Topology
Ø Wide area
networks (WAN) – point-to-point
o characteristics
§ low bandwidth
§ low speed
§ high protocol
overhead
o communication
cost will dominate; ignore all other cost factors
o global schedule
to minimize communication cost
o local schedules
according to centralized query optimization
Ø Local area
networks (LAN)
o communication
cost not that dominant
o total cost
function should be considered
o broadcasting can
be exploited (joins)
o special
algorithms exist for star networks
Step 1 – Query
Decomposition
§ Input : Calculus query on global relations
Ø Normalization
o manipulate query
quantifiers and qualification
Ø Analysis
o detect and reject
“incorrect” queries
o possible for only
a subset of relational calculus
Ø Simplification
o eliminate
redundant predicates
Ø Restructuring
o calculus query Þ
algebraic query
o more than one
translation is possible
o use
transformation rules
Step 2 – Data
Localization
Ø Input: Algebraic query on distributed relations
Ø Determine which
fragments are involved
Ø Localization
program
o substitute for
each global query its materialization program
o optimize
Step 3
– Global Query Optimization
Ø Input: Fragment query
Ø Find the best
(not necessarily optimal) global schedule
o Minimize a cost
function
o Distributed join
processing
§ Bushy vs. linear
trees
§ Which relation to
ship where?
§ Ship-whole vs
ship-as-needed
o Decide on the use
of semijoins
§ Semijoin saves on
communication at the expense of more local processing.
o Join methods
§ nested loop vs
ordered joins (merge join or hash join)
Centralized
Query Optimization
Ø INGRES
o dynamic
o interpretive
Ø System R
o static
o exhaustive search
12 Rules of DDBMS (Date, 1987)
1. Local autonomy
2. No reliance on a
central site
3. Continuous
operation
4. Location
independence
5. Fragmentation
independence
6. Replication
independence
7. Distributed Query
processing
8. Distributed
transaction processing
9. Hardware independence
10.
Operating System independence
11.
Network independence
12.
Database independence