menu

Showing posts with label database performance. Show all posts
Showing posts with label database performance. Show all posts

Monday, November 17, 2014

Improving Database Performance in a Multithreading Environment-1

Introduction:

     Today many database systems run in multithread environments.In a multithread environment the most important and critical issue is to obtain and maintain a good performance as well as with a success of concurrent access to the resources. In this write we'll talk about two points related to database performance "Concept of Denormalization" and "Join Methods and Techniques". In the future writes we'll be talking about the other aspects that effect the database performance.

Concept of Denormalization:

     Normalization is the process of reorganizing the tables and fields to minimize redundancy and tight dependency. This optimize update cost since the value is stored in only one place and you only need to update that place. However, retrieving many different but related values usually requires going to many different places. This cause to slower retrieval process. Conceptual diagrams, either entity-relationship or object oriented, are a precursor to designing relational table structures.Many database systems are designed such that the tables are at least in third normal form (3NF) base on conceptual models, but the more you normalize the more level you will need to reach the data you want. As client-server platforms are going to be cheaper rather than strong mainframes, the performance will decrease because of the normalization. One critical fact is the need to continue to provide a satisfactorty level of system performance, usually reflected by the systems response time, for online transactional systems. A fully normalized database schema can fail to provide adequate system response time due to excessive table join operations.

How to Perform Denormalization:

     When denormalization takes into place you must consider both good system response time and avoiding various anomalies and problems associated with denormalized table structures. Before denormalization you have to analyze critical transactions in detail. This analyse should include the specification of primary and secondary access paths for tables that comprise the end-user view of the database. Furthermore, denormalization should only be considered under conditions that allow designers to collect detail performance mesasurements for comparison with system performance requirements. Relational performance theory provides guidelines for achieving an idealized representaion of data and data relationships. On the contrary, client-server systems require physical database design that optimize performance for specific target systems under less than ideal conditions. The final database schema should be adjusted for characteristics of the environment such as harware, software and other constraints.

The general steps of denormalization could be the following:

1.) Analyze the queries that required a lot of join operations to get the related data. We may say that any queries that requires more than three joins should be considered as a candidate for denormalization. Also you need to measure system performance by simulating the production environment to prove the need for denormalization.

2.) You should try to reduce the number of foreign keys in order to reduce the index maintenance during insert, update and delete operations. Reducing foreing keys means to reduce number of relational tables.

3.) Data maintenance which is the main purpose of normalization should also be provided by the denormalized schema. Therefore, the denormalized approach should'nt require excessive programming effort like triggers to maintain data integrity and consistency.

A Denormalization Example:

     We can examine denormalization from 3NF to 2NF by the following example. The relationship between the Customer and Order entities is one-to-many (many orders can be processed by a single customer, but an order is normally associated with one and only one customer). The 3NF solution for the Customer and Order tables is given below, along with a denormalized 2NF solution.

3NF:
CUSTOMER (CustomerId, CustomerName,...)
ORDER (OrderId, OrderDate, DeliveryDate, Amount, CustomerId)

Denormalized 2NF:
CUSTOMER (CustomerId, CustomerName,...)
ORDER (OrderId, OrderDate, DeliveryDate, Amount, CustomerId, CustomerName)

Here we see that CustomerId in the Order table is a foreign key relating the Order and Customer tables. Denormalizing the table structures by duplicating the CustomerName in the Order table results in a solution that is 2NF because the non-key CustomerName is determined by the non-key CustomerId field. If we analyse the denormalized situation we may discover that most end-user requests require getting customer name, not the customer identification number. However the 3NF solution requires joining the Customer and Order tables. 
When we compare the 3NF solution with the denormalized 2NF solution, we see that customer name could easily be recorded to the denormalized Order table at the time that the order transaction takes place. In the denormalized form we may question the need to maintain the consistency of the data between the Order and Customer tables. In this situation, the requirement to support name changes for customer is very small, and only occurs when a customer changes her name due to marriage. Furthermore, the necessity to update the Order table in such a situation is a decision for management to make and such data maintenance may be ignored, since the important issues for customers usually revolves around whether or not they get paid their charge, and the denormalized 2NF solution supports payroll activities as well as the production of standart end-user views of the database.
It's crucial to remember that denormalization was initially implemented for performance reasons. If the environment changes the denormalization process should be reconsidered. Also its possible that, given a changing hardware and software environment, denormalized tables may be causing performance degredation instead of performance gains.

Join Methods and Techniques:

     In a query based application the need to join two or more tables is one of the most performance degredation operation. Generally when joining tables the Database Management System first chooses one of the tables which is called "outer table", and get that table to memory for quick access. Then this outer table is prepared for the join operation and finally combined with the second choosen table which is called "inner table". If there are more than two tables to join, the same operation is done using the result of first join and the choosen third table.
When designing and programming a database application one should take into account the following techniques.

  • Choose smaller table as the outer table: Since the outer table will get to the memory, a smaller table will get with less cost. Also the bigger one that is the inner table will be accessed less, since it will be accessed only by the qualifying rows of the outer table.
  • Choose the outer table as if it can use the selective predicate in the where clause: By that way, the inner table will be called only for the satisfying rows of the outer table.
  • Choose the outer table as if it has less duplicate rows:
  • Choose the inner table as if it can be access using index lookup: Index lookup means we don't need to go to table to get the information and get all the information from the index. Today many Database Management Systems keeps indexed in memory. As we mentioned, we get also the outer table into the memory during the initial phase of joining. Hence in that case both outer and inner table are in memory , and it greatly improves the performance.
We can now look up to the join methods. Today's database systems are usually using three main join methods.
      

1.) Nested Loop Join (NLJ):

     With the NLJ, a qualifying row is identified in the outer table, and then the inner table is scanned searching for a match. A qualifying row is one in which the predicated for colums in the table match. When the inner table's scan is complete, another qualifying row in the outer table is identified. The inner table is scanned for a match again, and so on. The repeating scanning of the inner table is usually accomplished with an index to minimize I/O cost.

2.) Merge Join (MJ):

     With the MJ, the tables of the join operation need to be ordered by the join predicates. That means that each table must be accessed in order by the columns that specify the join criteria. This ordering can be result of either a sort or indexed access. After ensuring both the outer and the inner table are properly sequenced, each table is read sequentially and the join columns are matched up. Neither table is read more than once during a merge join.

3.) Hash Join (HJ):

     Hash join requires one or more predicated of the form table1.ColX = table2.ColY where the column types are same. The inner table is scanned and the rows are copied into the memory buffers drawn from the sort heap allocation.The memory buffers are divided into partitions based on a "hash code" computed from column(s) of the join predicate(s). If the size of the first table exceeds the available sort heap space, buffers from the selected partitions are written to temporary tables. After processing the inner table, the outer table is scanned and it's rows are matched to the inner table rows by comparing the "hash code". Hash joins can require a significant amount of memory. Therefore, for the hash join to produce realistic performance benefits, you many need to change the value of the database configuration parameters that are related with the memory management.

     When do we know, which of these join methods should be used ? In general, the nested loop join is preferred in terms of execution cost when a small number of rows qualified for the join.As the number of qualifying rows increases, the merge join becomes a better choose. Finally, in the case of a hash join, the inner table is kept in memory buffers. If there are too few memory buffers, then the hash join possibly will fail. The optimizer attempts to avoid this and so pick the smaller of the two tables as the inner table, and the larger one as the outer table.
     Results of performance generalizations will depend on the exact number of qualifying rows as well as other factors such as your database design, database organization, accuracy of statictics, type of hardware and the setup of your environment.Today's database systems chooses the join technique automatically with the help of an optimizer. However, looking to the situation, one can change the join method by changing the query and table definition to obtain a better performance. 

Conclusion:

      In this write, we examine a database application performance in a multithread environment in terms of two points. There are many aspects that affect the performance of such an application. The normalization is one of the most critical issues in a database design. However, too much normalization will cause many join operation and negatively affect the performance, so we have to try to denormalize the tables without breaking the data maintenance. Even if we denormalize some tables, there may be still some normalized tables that we have to join them in some requirements. When we join the tables not only the expected results but also at least on of the tables will get to the memory.Specifying that table is crucial. As we join tables, there are some rules to consider, like first get the smaller table into the memory in NLJ. If we rewrite out sql by considering those facts, the DBMS will get the results quickly by using the appropriate join method. In the next paper we'll be talking about "Minimizing Contention" in multithreading environment.

References and Resources:

- John Goodson and Robert A. Steward, The Data Access Handbook Achieving Optimal Database Application Performance and Scalability, Prentice Hall, Chapter 1-2 (2009)

- Douglas B. Bock and John F. Schrage, Department of Computer Management and Information Systems, Southern Illinois University Edwardsville, published in the 1996 Proceedings of the Decision Sciences Institute, BENEFITS OF DENORMALIZED RELATIONAL DATABASE TABLES , Orlando, Florida, November, 1996

- Craig Mullins, Tuning DB2 SQL Access Paths, http://www.ibm.com/developerworks/data/library/techarticle/0301mullins/0301mullins.html

Improving Database Performance in a Multithreading Environment-3

Introduction:

     We mentioned about "Minimizing Contention" on improving database performance in the second part of this paper[2]. Now we continue with another concept , "Insert Only Systems and Optimizing Insert Performance" that one may consider when try to improve database performance. Update and Delete operations are very likely to cause locks. If we plan an insert only system and don't choose "Serializable" isolation level and choose to select uncommitted data we'll get no lock in our system and have the highest concurrency. However, insert only systems may cause some other problems. One of them is - since every operation like update and delete is done using an insert - the amount of data in the database will increase rapidly. If we don't consider that, because of huge amount of data, we can get performance degredation. To solve that issue we have to archive our data properly and rapidly. Also we can use partitioning to behave our data as partitioned causing a decrease in the amount of data being operated. Another problem with insert only systems appears when an application has too hight insert rate. The reason for that is, almost all tables generally have a clustered index in which the order of the data is same with the table. When an insert request comes to the DBMS it tries to find a good place to the new data to obey the clustered rule. It causes a search operation to find out the correct place to insert. If there is too big data to search, it will slow down our application. 

Now we'll mention about some methods to optimize insert performance.

Insert Performance Issues:

1.) Indexes on tables are usually created for performance reasons to decrease the search time when selecting data. However every new index added to the table also has a negative impact which appears when an insert, update or delete operation is done. With these DML operations every index must be rearranged. Thus, one could try to decrease the number of indexes on the tables. 
Also most of the indexes on today DMBS systems are B tree indexes and if the levels of the index tree or number of pages at levels are high we possibly get performance degradation. To solve these issues many DBMS systems have capability to keep indexes in memory so we have to try to keep affected indexes, at least non-leaf pages which are %1 of total index pages, in memory by changing the cached memory size. 

2.) If there are sequential inserts on our insert only system, we can apply some methods to insert at the end of the table without searching for a place to obey the clustered colums rule. By this way, one can eliminate read I/O since no search will be done, and reduce write I/O since whole data will be written to the same place.

3.) By inserting rows to empty buffers that is to the memory and scheduling disk write to a later time is a good option. By this way, disk I/O eliminated and DBMS will do the I/O when the system is not heavy loaded.

4.) A DBMS system always keeps some logs for recovery reasons. We have to minimize that log records since it will get another I/O cost to our system. To do this, we may have to do different preventions, for example for DB2 database, we can put frequently changed and variable columns to the end of the table, because a variable lenght row is logged from the first changed byte to the end of the row when an update operation is done on that row (This issue is solved by the DB2 release 9 by automatically placing variable columns to the end of the table). We can also decrease log size by changing some parameters on DBMS system, for example in DB2 "LOAD Log No" parameter means there will be no log created when data is loaded. 

5.) Another issue is batch inserts. That means there will be no need to go the database system for every DML operation in a transaction like Update, Delete or Insert. Instead every operation will be finished in one step. As an example, this can be done with java prepared statement batch option. See in this post.

6.) Usually increasing the commit frequency is the preferred way to minimize the lock wait time.

Conclusion:

     In this write we examine database application performance in a multithread environment in terms of insert only systems to prevent lock issues. If we can design our system with a good archive mechanism we may use insert only systems preventing locks and increasing concurrent access.

References and Resources:

Guy Harrison, Oracle® Performance Survival Guide A Systematic Approach to Database Optimization, Prentice Hall, Part IV- V (2009)

Friday, October 10, 2014

Using Batch Statements In Java

Introduction:

     Using batch statements is closely related with prepared statement usage, and it is also has huge performance benefit. If you need to insert,update or delete more than one tables or one tables with more than one times in the same connection you should use batching. By using batching , you will go to database only one time.If you use more than one tables in a batch statements it is called heterogen batch, otherwise it is called homogen batch. In heterogen batch you have to use addBatch(String sql) overloaded method of  statement object and you cannot use prepared statement. Below is an example of homogen batching.

String CONN_STRING = "jdbc:mysql://localhost:3306/DB_NAME?user=USER&password=PWD";
String UPDATE_SQL = "UPDATE User SET age=? WHERE name=?";
     Connection conn = null;
     PreparedStatement pstmt = null;
     try {
              Integer[] ages = new Integer[] { 5, 10, 15};
              String[] names = new String[] { “xxx”, “yyy”, “zzz”};
              Class.forName("com.mysql.jdbc.Driver");
              conn = DriverManager.getConnection(CONN_STRING);
              conn.setAutoCommit(false);
              pstmt = conn.prepareStatement(UPDATE_SQL);
              for (int i = 0; i < ages.length; i++) {
                    pstmt.setInt(1, ages[i]);
                    pstmt.setString(2, names [i]);
                    pstmt.addBatch();
                    }
              pstmt.executeBatch();
              conn.commit();
             } catch (SQLException e) {
                      try {
                              if (conn != null && !conn.isClosed())
                                    conn.rollback();
                                    // TODO logging...
                             } catch (SQLException e1) {
                                      // TODO logging...
                             }
            } catch (Exception e) {
                        try {
                                if (conn != null && !conn.isClosed())
                                    conn.rollback();
                                    // TODO logging...
                                } catch (SQLException e1) {
                                      // TODO logging...
                                }
             } finally {
                          try {
                                  if (pstmt != null)
                                         pstmt.close();
                                  if (conn != null && !conn.isClosed())
                                         conn.close();
                            } catch (SQLException e) {
                                      // TODO logging...
                            }
             }                

Some important points of using batch statements are;
  • You can get the number of rows that were effected by each SQL statement from the array that   the executeBatch invocation returns. This number does not include rows that were affected by triggers or by referential integrity enforcement.
  • Even if the autocommit mode is on, all the data will be committed to database after executing   executeBatch.
  • When you use batching, you go to the database only for one time. However the database engine doesn’t  necessarily execute all queries in one process.For example in DB2 you have to enable multi-row insert property to execute all batched queries in one process.In this case the executeBatch will not return the effected rows, instead return Statement.SUCCESS_NO_INFO (-2).This will give better  performance since the database engine doesn’t need to return the effected rows, and the whole batch  will be executed in one process.
  • If you have heavy insert or update statements, you can do the insert or update to memory, and later can write to database by an asynchron background thread.

Using single sql instead of batch:

     You can use single sql for insert operation instead of batch statement. In this way you do multi insert with one sql statement, and again using prepared statement, you can pass your parameters.Single sql statement executed about 10 times faster than batch statements in my machine, since the database engine execute it in a single process like multi-row insert enabled DB2.(mentioned above in Batch statements)
For update or delete operations you can use IN keyword to use a single sql for multi operation.
      
Oracle example:
INSERT ALL
  INTO test (id, name) VALUES (1, 'xxx')
  INTO test (id, name) VALUES (2, 'yyy')
  INTO test (id, name) VALUES (3, 'zzz')
SELECT * FROM dual;
Db2,Mysql, Mssql, Postgre Sql example:
INSERT INTO test (id, name) VALUES (1, 'xxx'), (2, 'yyy'), (3, 'zzz');

Note: In mysql you can use rewriteBatchedStatements=true property in connection string. It creates a single sql statement for all batch data, and send to database to execute in one process. That means its faster than standart batching.However, its slower than single sql statement approach because of the cost of creating single sql statement from all batch data. You can also tune max_allowed_packet property pf mysql to limit number of rows that can be executed in a batch statement. Depending on the processor and database engine, number of rows that can be executed in a batch may vary and you may have to limit max number of rows in a batch.

Conclusion

     Batch statements are crucial in performance related applications, since it reduces the database calls, and improve response times. You can also choose to run single sql statements instead of batching to get faster results. Be careful about that, batch statements are not necessarily executed in a single process in database side, that is you may need to enable database specific properties to enable multi-row operations in one process.

Using Prepared Statement In Java


Introduction:

     Using prepared statements, we can get faster and more secure results when we send queries to the database systems. In this write we’ll discuss both the benefits and performance impact of using prepared statements.

Why more secure ?

     Below is a java code written to execute an update statement in mysql database.Normally we should use connection pooling instead of directly connecting the database, but here we use direct connection just for explanation.

  String CONN_STRING = "jdbc:mysql://localhost:3306/DB_NAME?user=USER&password=PWD";
  String UPDATE_SQL = "UPDATE User SET age=? WHERE name=?";
  Connection conn = null;
  PreparedStatement pstmt = null;
    try {
            Class.forName("com.mysql.jdbc.Driver");
            conn = DriverManager.getConnection(CONN_STRING);
            conn.setAutoCommit(false);
            pstmt = conn.prepareStatement(UPDATE_SQL);
            pstmt.setInt(1, 30);
            pstmt.setString(2, "xxx");
            int rows = pstmt.executeUpdate();
            conn.commit();
            System.out.println(rows + " rows updated");
          } catch (SQLException e) {
               try {
                       if (conn != null && !conn.isClosed()) 
                           conn.rollback();
                           // TODO logging...
                     } catch (SQLException e1) {
                           // TODO logging...
                     }
          } catch (Exception e) {
               try {
                       if (conn != null && !conn.isClosed())
                           conn.rollback();
                           // TODO logging...
                      } catch (SQLException e1) {
                          // TODO logging...
                      }
          } finally {
                try {
                        if (pstmt != null)
                            pstmt.close();
                        if (conn != null && !conn.isClosed())
                            conn.close();
                       } catch (SQLException e) {
                               // TODO logging...
                       }
          }                      

As we see from the above code, the parameters that sql sentence expect were sent by question mark. Lets think the age and name value in the sql sentence comes from external world, ie from screen.If this is the case, when we directly add age and name value to our sql sentence instead of using prepared statement, we allow the external system to modify our sql sentence. On the other hand, with the help of prepared statement, the value that comes from external word will not be part of the sql sentence, instead it will be only a parameter. This prevents sql injection.An injected sql sentence might be
UPDATE User SET age=5 WHERE name=’xxx’ or 1=1. Here the “name” value comes from external world and include value ‘xxx’ or 1=1 and it updates all table rows with age 5. To prevent this you need to add extra controls in your java code, but thanks to the prepared statement, the name parameter cannot be send to be included in sql sentence like ‘xxx’ or 1=1.

Why better performance ?

     First, lets examine the steps that a query passes before executing. I try to be generic, but these steps may vary slightly depending on the database engine.

1.)   Parsing:

     In this step, the sql sentence is checked for correct syntax, spelling and keywords. Also its checked for the existing of tables and columns.Depending on the database engine there may also be some syntax optimizations like converting right outer joins to left outer joins and converting subqueries to equivalent joins.

2.)   Compiling:

     This step may be optional. If the sql sentence include database specific functions or expressions, they need to be compiled to standart DML language.

3.)   Optimizing:

           The most important step is optimizing since it directly effects the performance.In the optimizing step the query processor of the database engine try to find out the most effective way to run the query. If the sql sentence is executed for the first time it looks for the available statistics and extract an execution plan(access path) for the query.This execution plan is extracted by looking to statistics of indexes, relations on tables and data load on tables. The plan may include any kind of join techniques including nested loop join, merge join  ad hash join .For details of join techniques you can look at here.

4.)   Execution:
      
           After determining an execution plan, the query is executed as communicating between the query processor and the database engine.As we said, if there is a previously created plan exists, it will be used by the query processor.That means the first 3 steps will be executed only one time.How is this plan found by the processor may vary database to database.A possible solution may be storing hash of query at the first execution, and later use this hash value to determine if there is a plan exist. The most used plans may also be cached in memory for fast lookup.

Now lets look for the benefit of prepared statement at this point.Creating a execution plan for a query may take longer than executing the query.Since its a heavy load process,after the first execution, the plan is saved for later invocations, and re-used. If you don’t use prepared statement and send the sql parameters as part of the query, each execution will be treated as a different query by the query processor because changed parameters will change sql sentence, and each time it will try to create and use an execution plan.


Application side of using prepared statement:

           Up to now we look for the database side of using prepared statements. However there are also application and application server side effects.In java side, since prepared statement is a heavy load object, it can be cached. Most of the application servers have ability of caching prepared statements.If it is cached, not every call of preparedStatement method of Connection object will create a new statement, instead it will get from the cache if there exist one.

      Another important point is closing the prepared statement after using. Most of the database implementations have both connection limit and max. number of statements limit per connection. That means if you forget to close prepared statement, and if you use more than one statement for one connection, you can get database error indicating there is no available prepared statement to open.

       Conclusion:

           Today we talk about prepared statement usage in java. It is so important using prepared statements if you want more secure and faster applications.Prepared statements are used with fine grained jdbc calls and although the first execution is relatively slow because of the impact of creating execution plan, the later invocation is as fast as using stored procedures or sqlj calls. As a related topic you may like this article about batching.