shareengineer: 1.4 QUERY PROCESSING

Ø using client-server architecture

Ø user creates query

Ø client parses and sends to server(s) (SQL?)

Ø servers return appropriate Tables

Ø client combines into one Table

Ø Issue of data transfer cost over a network

o optimise the query to transfer the least amount

Query Processing Components

Ø Query language that is used

o SQL: “intergalactic dataspeak”

Ø Query execution methodology

o The steps that one goes through in executing high-level (declarative) user queries.

Ø Query optimization

o How do we determine the “best” execution plan?

Query Optimization Objectives

Ø Minimize a cost function

§ I/O cost + CPU cost + communication cost

Ø These might have different weights in different distributed environments

Ø Wide area networks

o communication cost will dominate

§ low bandwidth

§ low speed

§ high protocol overhead

o most algorithms ignore all other cost components

Ø Local area networks

o communication cost not that dominant

o total cost function should be considered

Ø Can also maximize throughput

Query Optimization Issues – Types of Optimizers

Ø Exhaustive search

o cost-based

o optimal

o combinatorial complexity in the number of relations

Ø Heuristics

o not optimal

o regroup common sub-expressions

o perform selection, projection first

o replace a join by a series of semijoins

o reorder operations to reduce intermediate relation size

o optimize individual operations

Optimization Granularity

Ø Single query at a time

o cannot use common intermediate results

Ø Multiple queries at a time

o efficient if many similar queries

o decision space is much larger

Optimization Timing

Ø Static

o compilation Þ optimize prior to the execution

o difficult to estimate the size of the intermediate results Þ error propagation

o can amortize over many executions

o R*

Ø Dynamic

o run time optimization

o exact information on the intermediate relation sizes

o have to reoptimize for multiple executions

o Distributed INGRES

Ø Hybrid

o compile using a static algorithm

o if the error in estimate sizes > threshold, reoptimize at run time

o MERMAID

Statistics

Ø Relation

o cardinality

o size of a tuple

o fraction of tuples participating in a join with another relation

Ø Attribute

o cardinality of domain

o actual number of distinct values

Ø Common assumptions

o independence between different attribute values

o uniform distribution of attribute values within their domain

Decision Sites

Ø Centralized

o single site determines the “best” schedule

o simple

o need knowledge about the entire distributed database

Ø Distributed

o cooperation among sites to determine the schedule

o need only local information

o cost of cooperation

Ø Hybrid

o one site determines the global schedule

Ø each site optimizes the local subqueries

Network Topology

Ø Wide area networks (WAN) – point-to-point

o characteristics

§ low bandwidth

§ low speed

§ high protocol overhead

o communication cost will dominate; ignore all other cost factors

o global schedule to minimize communication cost

o local schedules according to centralized query optimization

Ø Local area networks (LAN)

o communication cost not that dominant

o total cost function should be considered

o broadcasting can be exploited (joins)

o special algorithms exist for star networks

Step 1 – Query Decomposition

§ Input : Calculus query on global relations

Ø Normalization

o manipulate query quantifiers and qualification

Ø Analysis

o detect and reject “incorrect” queries

o possible for only a subset of relational calculus

Ø Simplification

o eliminate redundant predicates

Ø Restructuring

o calculus query Þ algebraic query

o more than one translation is possible

o use transformation rules

Step 2 – Data Localization

Ø Input: Algebraic query on distributed relations

Ø Determine which fragments are involved

Ø Localization program

o substitute for each global query its materialization program

o optimize

Step 3 – Global Query Optimization

Ø Input: Fragment query

Ø Find the best (not necessarily optimal) global schedule

o Minimize a cost function

o Distributed join processing

§ Bushy vs. linear trees

§ Which relation to ship where?

§ Ship-whole vs ship-as-needed

o Decide on the use of semijoins

§ Semijoin saves on communication at the expense of more local processing.

o Join methods

§ nested loop vs ordered joins (merge join or hash join)

Centralized Query Optimization

Ø INGRES

o dynamic

o interpretive

Ø System R

o static

o exhaustive search

12 Rules of DDBMS (Date, 1987)

1. Local autonomy

2. No reliance on a central site

3. Continuous operation

4. Location independence

5. Fragmentation independence

6. Replication independence

7. Distributed Query processing

8. Distributed transaction processing

9. Hardware independence

10. Operating System independence

11. Network independence

12. Database independence

shareengineer

Pages

Translate

Sunday, January 6, 2013

1.4 QUERY PROCESSING

No comments:

Post a Comment