Translate

Sunday, January 6, 2013

1.4 QUERY PROCESSING




Ø using client-server architecture
Ø user creates query
Ø client parses and sends to server(s) (SQL?)
Ø servers return appropriate Tables
Ø client combines into one Table
Ø Issue of data transfer cost over a network
o   optimise the query to transfer the least amount




Query Processing Components

Ø Query language that is used
o   SQL: “intergalactic dataspeak”
Ø Query execution methodology
o   The steps that one goes through in executing high-level (declarative) user queries.
Ø Query optimization
o   How do we determine the “best” execution plan?
  

Query Optimization Objectives
Ø Minimize a cost function
§  I/O cost + CPU cost + communication cost
Ø These might have different weights in different distributed environments
Ø Wide area networks
o   communication cost will dominate
§  low bandwidth
§  low speed
§  high protocol overhead
o   most algorithms ignore all other cost components
Ø Local area networks
o   communication cost not that dominant
o   total cost function should be considered
Ø Can also maximize throughput


Query Optimization Issues – Types of Optimizers
Ø Exhaustive search
o   cost-based
o   optimal
o   combinatorial complexity in the number of relations


Ø Heuristics
o   not optimal
o   regroup common sub-expressions
o   perform selection, projection first
o   replace a join by a series of semijoins
o   reorder operations to reduce intermediate relation size
o   optimize individual operations

Optimization Granularity
Ø Single query at a time
o   cannot use common intermediate results
Ø Multiple queries at a time
o   efficient if many similar queries
o   decision space is much larger
Optimization Timing
Ø Static
o   compilation Þ optimize prior to the execution
o   difficult to estimate the size of the intermediate results  Þ error propagation
o   can amortize over many executions
o   R*
Ø Dynamic
o   run time optimization
o   exact information on the intermediate relation sizes
o   have to reoptimize for multiple executions
o   Distributed INGRES
Ø Hybrid
o   compile using a static algorithm
o   if the error in estimate sizes > threshold, reoptimize at run time
o   MERMAID
Statistics
Ø Relation
o   cardinality
o   size of a tuple
o   fraction of tuples participating in a join with another relation
Ø Attribute
o   cardinality of domain
o   actual number of distinct values
Ø Common assumptions
o   independence between different attribute values
o   uniform distribution of attribute values within their domain
Decision Sites
Ø Centralized
o   single site determines the “best” schedule
o   simple
o   need knowledge about the entire distributed database
Ø Distributed
o   cooperation among sites to determine the schedule
o   need only local information
o   cost of cooperation
Ø Hybrid
o   one site determines the global schedule
Ø each site optimizes the local subqueries
Network Topology
Ø Wide area networks (WAN) – point-to-point
o   characteristics
§  low bandwidth
§  low speed
§  high protocol overhead
o   communication cost will dominate; ignore all other cost factors
o   global schedule to minimize communication cost
o   local schedules according to centralized query optimization
Ø Local area networks (LAN)
o   communication cost not that dominant
o   total cost function should be considered
o   broadcasting can be exploited (joins)
o   special algorithms exist for star networks

Step 1 – Query Decomposition
§  Input :  Calculus query on global relations
Ø Normalization
o   manipulate query quantifiers and qualification
Ø Analysis
o   detect and reject “incorrect” queries
o   possible for only a subset of relational calculus
Ø Simplification
o   eliminate redundant predicates
Ø Restructuring
o   calculus query Þ algebraic query
o   more than one translation is possible
o   use transformation rules
Step 2 – Data Localization
Ø Input:  Algebraic query on distributed relations
Ø Determine which fragments are involved
Ø Localization program
o   substitute for each global query its materialization program
o   optimize

Step 3 – Global Query Optimization
Ø Input:  Fragment query
Ø Find the best (not necessarily optimal) global schedule
o   Minimize a cost function
o   Distributed join processing
§  Bushy vs. linear trees
§  Which relation to ship where?
§  Ship-whole vs ship-as-needed
o   Decide on the use of semijoins
§  Semijoin saves on communication at the expense of more local processing.
o   Join methods
§  nested loop vs ordered joins (merge join or hash join)

Centralized Query Optimization
Ø INGRES
o   dynamic
o   interpretive
Ø System R
o   static
o   exhaustive search



12 Rules of DDBMS (Date, 1987)
1.     Local autonomy
2.     No reliance on a central site
3.     Continuous operation
4.     Location independence
5.     Fragmentation independence
6.     Replication independence
7.     Distributed Query processing
8.     Distributed transaction processing
9.     Hardware independence
10.                        Operating System independence
11.                        Network independence
12.                        Database independence

1.1 DISTRIBUTED DATABASES VS CONVENTIONAL DATABASES




Ø mimics organisational structure with data
Ø local access and autonomy without exclusion
Ø cheaper to create and easier to expand
Ø improved availability/reliability/performance by removing reliance on a central site
Ø Reduced communication overhead
§  Most data access is local, less expensive and performs better
Ø Improved processing power
§  Many machines handling the database rather than a  single server

Ø more complex to implement
Ø   more costly to maintain
Ø security and integrity control
Ø standards and experience are lacking
Ø Design issues are more complex


1.2 DISTRIBUTED   DATABASES ARCHITECTURE 

Ø Defines the structure of the system
o   components identified
o   functions of each component defined
o   interrelationships and interactions between components defined

What is a distributed database?

l“A logically interrelated collection of shared data (and a description of this   data), physically distributed over a computer network”
Implicit Assumptions

Ø Data stored at a number of sites ï each site logically consists of a single processor.
Ø Processors at different sites are interconnected by a computer network ï no multiprocessors
o   parallel database systems
Ø Distributed database is a database, not a collection of files ï data logically related as exhibited in the users’ access patterns
o   relational data model
Ø D-DBMS is a full-fledged DBMS
o   not remote file system, not a TP system

Advantages of distributed databases
Ø Capacity and incremental growth
Ø Increase reliability and availability
Ø Modularity
Ø Reduced communication overhead
Ø Protection of valuable data
Ø Efficiency and Flexibility

Disadvantages of distributed databases
Ø DDB design more complex, fragmentation & replication; extra work must be done by the DBAs to ensure that the distributed nature of the system is transparent.
Ø Economics,
Ø Concurrency control,
Ø Inexperience,
Ø Security,
Ø Difficult to maintain integrity

Applications
Ø Manufacturing - especially multi-plant manufacturing
Ø Military command and control

Ø Electronic fund transfers and electronic trading
Ø Corporate MIS
Ø Airline restrictions
Ø Hotel chains
Ø Any organization which has a decentralized organization structure