menu

Monday, November 17, 2014

Improving Database Performance in a Multithreading Environment-1

Introduction:

     Today many database systems run in multithread environments.In a multithread environment the most important and critical issue is to obtain and maintain a good performance as well as with a success of concurrent access to the resources. In this write we'll talk about two points related to database performance "Concept of Denormalization" and "Join Methods and Techniques". In the future writes we'll be talking about the other aspects that effect the database performance.

Concept of Denormalization:

     Normalization is the process of reorganizing the tables and fields to minimize redundancy and tight dependency. This optimize update cost since the value is stored in only one place and you only need to update that place. However, retrieving many different but related values usually requires going to many different places. This cause to slower retrieval process. Conceptual diagrams, either entity-relationship or object oriented, are a precursor to designing relational table structures.Many database systems are designed such that the tables are at least in third normal form (3NF) base on conceptual models, but the more you normalize the more level you will need to reach the data you want. As client-server platforms are going to be cheaper rather than strong mainframes, the performance will decrease because of the normalization. One critical fact is the need to continue to provide a satisfactorty level of system performance, usually reflected by the systems response time, for online transactional systems. A fully normalized database schema can fail to provide adequate system response time due to excessive table join operations.

How to Perform Denormalization:

     When denormalization takes into place you must consider both good system response time and avoiding various anomalies and problems associated with denormalized table structures. Before denormalization you have to analyze critical transactions in detail. This analyse should include the specification of primary and secondary access paths for tables that comprise the end-user view of the database. Furthermore, denormalization should only be considered under conditions that allow designers to collect detail performance mesasurements for comparison with system performance requirements. Relational performance theory provides guidelines for achieving an idealized representaion of data and data relationships. On the contrary, client-server systems require physical database design that optimize performance for specific target systems under less than ideal conditions. The final database schema should be adjusted for characteristics of the environment such as harware, software and other constraints.

The general steps of denormalization could be the following:

1.) Analyze the queries that required a lot of join operations to get the related data. We may say that any queries that requires more than three joins should be considered as a candidate for denormalization. Also you need to measure system performance by simulating the production environment to prove the need for denormalization.

2.) You should try to reduce the number of foreign keys in order to reduce the index maintenance during insert, update and delete operations. Reducing foreing keys means to reduce number of relational tables.

3.) Data maintenance which is the main purpose of normalization should also be provided by the denormalized schema. Therefore, the denormalized approach should'nt require excessive programming effort like triggers to maintain data integrity and consistency.

A Denormalization Example:

     We can examine denormalization from 3NF to 2NF by the following example. The relationship between the Customer and Order entities is one-to-many (many orders can be processed by a single customer, but an order is normally associated with one and only one customer). The 3NF solution for the Customer and Order tables is given below, along with a denormalized 2NF solution.

3NF:
CUSTOMER (CustomerId, CustomerName,...)
ORDER (OrderId, OrderDate, DeliveryDate, Amount, CustomerId)

Denormalized 2NF:
CUSTOMER (CustomerId, CustomerName,...)
ORDER (OrderId, OrderDate, DeliveryDate, Amount, CustomerId, CustomerName)

Here we see that CustomerId in the Order table is a foreign key relating the Order and Customer tables. Denormalizing the table structures by duplicating the CustomerName in the Order table results in a solution that is 2NF because the non-key CustomerName is determined by the non-key CustomerId field. If we analyse the denormalized situation we may discover that most end-user requests require getting customer name, not the customer identification number. However the 3NF solution requires joining the Customer and Order tables. 
When we compare the 3NF solution with the denormalized 2NF solution, we see that customer name could easily be recorded to the denormalized Order table at the time that the order transaction takes place. In the denormalized form we may question the need to maintain the consistency of the data between the Order and Customer tables. In this situation, the requirement to support name changes for customer is very small, and only occurs when a customer changes her name due to marriage. Furthermore, the necessity to update the Order table in such a situation is a decision for management to make and such data maintenance may be ignored, since the important issues for customers usually revolves around whether or not they get paid their charge, and the denormalized 2NF solution supports payroll activities as well as the production of standart end-user views of the database.
It's crucial to remember that denormalization was initially implemented for performance reasons. If the environment changes the denormalization process should be reconsidered. Also its possible that, given a changing hardware and software environment, denormalized tables may be causing performance degredation instead of performance gains.

Join Methods and Techniques:

     In a query based application the need to join two or more tables is one of the most performance degredation operation. Generally when joining tables the Database Management System first chooses one of the tables which is called "outer table", and get that table to memory for quick access. Then this outer table is prepared for the join operation and finally combined with the second choosen table which is called "inner table". If there are more than two tables to join, the same operation is done using the result of first join and the choosen third table.
When designing and programming a database application one should take into account the following techniques.

  • Choose smaller table as the outer table: Since the outer table will get to the memory, a smaller table will get with less cost. Also the bigger one that is the inner table will be accessed less, since it will be accessed only by the qualifying rows of the outer table.
  • Choose the outer table as if it can use the selective predicate in the where clause: By that way, the inner table will be called only for the satisfying rows of the outer table.
  • Choose the outer table as if it has less duplicate rows:
  • Choose the inner table as if it can be access using index lookup: Index lookup means we don't need to go to table to get the information and get all the information from the index. Today many Database Management Systems keeps indexed in memory. As we mentioned, we get also the outer table into the memory during the initial phase of joining. Hence in that case both outer and inner table are in memory , and it greatly improves the performance.
We can now look up to the join methods. Today's database systems are usually using three main join methods.
      

1.) Nested Loop Join (NLJ):

     With the NLJ, a qualifying row is identified in the outer table, and then the inner table is scanned searching for a match. A qualifying row is one in which the predicated for colums in the table match. When the inner table's scan is complete, another qualifying row in the outer table is identified. The inner table is scanned for a match again, and so on. The repeating scanning of the inner table is usually accomplished with an index to minimize I/O cost.

2.) Merge Join (MJ):

     With the MJ, the tables of the join operation need to be ordered by the join predicates. That means that each table must be accessed in order by the columns that specify the join criteria. This ordering can be result of either a sort or indexed access. After ensuring both the outer and the inner table are properly sequenced, each table is read sequentially and the join columns are matched up. Neither table is read more than once during a merge join.

3.) Hash Join (HJ):

     Hash join requires one or more predicated of the form table1.ColX = table2.ColY where the column types are same. The inner table is scanned and the rows are copied into the memory buffers drawn from the sort heap allocation.The memory buffers are divided into partitions based on a "hash code" computed from column(s) of the join predicate(s). If the size of the first table exceeds the available sort heap space, buffers from the selected partitions are written to temporary tables. After processing the inner table, the outer table is scanned and it's rows are matched to the inner table rows by comparing the "hash code". Hash joins can require a significant amount of memory. Therefore, for the hash join to produce realistic performance benefits, you many need to change the value of the database configuration parameters that are related with the memory management.

     When do we know, which of these join methods should be used ? In general, the nested loop join is preferred in terms of execution cost when a small number of rows qualified for the join.As the number of qualifying rows increases, the merge join becomes a better choose. Finally, in the case of a hash join, the inner table is kept in memory buffers. If there are too few memory buffers, then the hash join possibly will fail. The optimizer attempts to avoid this and so pick the smaller of the two tables as the inner table, and the larger one as the outer table.
     Results of performance generalizations will depend on the exact number of qualifying rows as well as other factors such as your database design, database organization, accuracy of statictics, type of hardware and the setup of your environment.Today's database systems chooses the join technique automatically with the help of an optimizer. However, looking to the situation, one can change the join method by changing the query and table definition to obtain a better performance. 

Conclusion:

      In this write, we examine a database application performance in a multithread environment in terms of two points. There are many aspects that affect the performance of such an application. The normalization is one of the most critical issues in a database design. However, too much normalization will cause many join operation and negatively affect the performance, so we have to try to denormalize the tables without breaking the data maintenance. Even if we denormalize some tables, there may be still some normalized tables that we have to join them in some requirements. When we join the tables not only the expected results but also at least on of the tables will get to the memory.Specifying that table is crucial. As we join tables, there are some rules to consider, like first get the smaller table into the memory in NLJ. If we rewrite out sql by considering those facts, the DBMS will get the results quickly by using the appropriate join method. In the next paper we'll be talking about "Minimizing Contention" in multithreading environment.

References and Resources:

- John Goodson and Robert A. Steward, The Data Access Handbook Achieving Optimal Database Application Performance and Scalability, Prentice Hall, Chapter 1-2 (2009)

- Douglas B. Bock and John F. Schrage, Department of Computer Management and Information Systems, Southern Illinois University Edwardsville, published in the 1996 Proceedings of the Decision Sciences Institute, BENEFITS OF DENORMALIZED RELATIONAL DATABASE TABLES , Orlando, Florida, November, 1996

- Craig Mullins, Tuning DB2 SQL Access Paths, http://www.ibm.com/developerworks/data/library/techarticle/0301mullins/0301mullins.html

No comments:

Post a Comment