Performance Test of esProc File Traversal Algorithm

Blog 1146 0

This article focuses on testing performance of esProc’s handling of file traversal, using commonly used group and aggregate operations as well as query and filter operations. The performance test will include evaluation of the influence of using multithreads, and comparison between esProc approach and Oracle counterpart for the same operation.

We use ordering information as the test data. The following is a selection of the Orders.b file in esProc binary format:

ORDERID CLIENT     SELLERID AMOUNT ORDERDATE NOTE

1       287  47     5825         2013-05-31       gafcaghafdgie f ci…

2       89     22     8681         2013-05-04       gafcaghafdgie f ci…

3       47     67     7702         2009-11-22       gafcaghafdgie f ci…

4       76     85     8717         2011-12-13       gafcaghafdgie f ci…

5       307  81     8003         2008-06-01       gafcaghafdgie f ci…

6       366  39     6948         2009-09-25       gafcaghafdgie f ci…

7       295  8       1419         2013-11-11       gafcaghafdgie f ci…

8       496  35     6018         2011-02-18       gafcaghafdgie f ci…

9       273  37     9255         2011-05-04       gafcaghafdgie f ci…

10     212  0       2155         2009-03-22       gafcaghafdgie f ci…

NOTE field is added for the purpose of increasing each record’s length but hasn’t practical meaning.

esProc script for grouping and aggregating data:

  A B
1 4  
2 fork to(A1) =file(“/files/orders.b”).cursor@bz(orderid,client,sellerid,amount;,A2:A1)
3   =B2.groups(client,sellerid;count(orderid):corderid,sum(amount):samount)
4   result B3
5 =A2.conj()  
6 =A5.groups(client,sellerid;sum(corderid):corderid,sum(samount):samount)
7 result A6  

Oracle approach to grouping and aggregating data using SQL:

select /*+ Parallel(8) */ client,sellerid,count(orderid),sum(amount) from orders group by client,sellerid .

esProc script for querying and filtering data:

  A B
1 4  
2 fork to(A1) =file(“/tools/data/orders.b”).cursor@bz(;,A2:A1)
3   =B2.select(sellerid==1 && client==50 && year(date(orderdate))>=2013).fetch()
4   result B3
5 =A2.conj()  
6 result A5  

Oracle approach to querying and filtering data using SQL:

select /*+ Parallel(4) */ * from orders where sellerid=1 and client=50 and to_char(orderdate,’yyyy’)>= 2013 .

Two types of hardware are used for the test:PC server (disk array) and regular PC (SSD)

Configuration for a PC server:

CPU: 2 Xeon E5620  ( 4 cores 8 threads, 16 threads in total)

Memory: 24GB

Hard drive: 800G HDD

Configurain for a regular PC:

CPU:1 Core(TM) i5-3450 (4 cores)

Memory: 16GB

SSD

Note: We will get the test result by computing the average value of running a script 3 times, during which the machine won’t be restarted. Both Oracle program and esProc program might store data temporarily in the disk.

Test results for the PC server:

Unit: Seconds

Test 1 for data grouping and aggregation (data volume is 25G):

Number of parallel tasks Oracle esProc
Single task 197 170
2 subtasks 204 112
4 subtasks 208 69
8 subtasks 210 65
16 subtasks 219 69

Test 2 for data grouping and aggregation (data volume is 10G):

Number of parallel tasks Oracle esProc
Single task 32 89
2 subtasks 16 51
4 subtasks 8 24
8 subtasks 5 19
16 subtasks 4 14

Test 3 for data grouping and aggregation (data volume is 3.5G):

Number of parallel tasks Oracle esProc
Single task 13 36
2 subtasks 6 20
4 subtasks 4 11
8 subtasks 2 8
16 subtasks 2 9

Test 1 for data query and filtering (data volume is 25G):

Number of parallel tasks Oracle esProc
Single task 83 107
2 subtasks 55 90
4 subtasks 79 65
8 subtasks 94 73
16 subtasks 96 93

Test 2 for data query and filtering (data volume is 10G):

Number of parallel tasks Oracle esProc
Single task 7 70
2 subtasks 3 29
4 subtasks 2 15
8 subtasks 1.3 11
16 subtasks 1.3 10

Test 3 for data query and filtering (data volume is 3.5G):

Number of parallel tasks Oracle esProc
Single task 2.7 24
2 subtasks 1.6 12
4 subtasks 0.9 6
8 subtasks 0.6 4
16 subtasks 0.7 4

Test result summary on PC server (disk array):

1. Maximum memory supported by Oracle database is 12G. When data volume is below this limit, like 10G and 3.5G, Oracle’s performance is better than esProc’s.

2. When data volume is greater than 12G, like 25G, the use of parallel computing has little effect of increasing performance. In grouping and aggregating data, esProc’s performance is significantly higher than Oracle; the two are nearly even in performing data query and filtering.

3. esProc makes quite a leap in performance in each addition of the number of parallel tasks from 1 to 4. But during the increase from 4 to 16, its performance improves not as much as Oracle’s.  

Test results for the regular PC:

Unit:Seconds

Test 1 for data grouping and aggregation (data volume is 25G):

Number of parallel tasks Oracle esProc
Single task 205 123
2 subtasks 206 107
4 subtasks 191 106

Test 2 for data grouping and aggregation (data volume is 10G):

Number of parallel tasks Oracle esProc
Single task 69 65
2 subtasks 68 37
4 subtasks 52 34

Test 3 for data grouping and aggregation (data volume is 3.5G):

Number of parallel tasks Oracle esProc
Single task 10 16
2 subtasks 5 9
4 subtasks 4 6

Test 1 for data query and filtering (data volume is 25G):

Number of parallel tasks Oracle esProc
Single task 143 112
2 subtasks 143 109
4 subtasks 144 107

Test 2 for data query and filtering (data volume is 10G):

Number of parallel tasks Oracle esProc
Single task 44 42
2 subtasks 44 24
4 subtasks 43 15

Test 3 for data query and filtering (data volume is 3.5G):

Number of parallel tasks Oracle esProc
Single task 1.5 9
2 subtasks 0.9 5
4 subtasks 0.7 3

Test result summary on regular PC:

1. In this case the maximum memory supported by Oracle database is 6G. When data volume is within the memory capacity, like 3.5G, Oracle performs better than esProc.

2. When data volume exceeds 12G, like 25G and 10G, parallel processing enhances performance very little. In grouping and aggregating data, esProc’s performance is significantly higher than Oracle; but only slightly better in performing data query and filtering.

Performance test summary:

According to the test results of both types of hardware, Oracle can achieve a better performance when its available memory can hold all data. The reason may be that it stores the data temporarily in the memory, instead of actually reading the disk. So it can work faster. In comparison, esProc, with no disk cache, reads the file for each computing action and thus can only make use of the limited buffer provided by the file system.

When a single task involves data that is more than the available memory can hold, esProc’s multithreading approach can achieve higher performance.

On the other hand, although PC server supports 16-thread computational power, performance rarely increases when the number of parallel tasks is nearly 16, due to the limited throughput of the disk array. In view of this, an esProc solution should set a reasonable number of parallel tasks according to the machine’s hard drive performance. 

FAVOR (0)
Leave a Reply
Cancel
Icon

Hi,You need to fill in the Username and Email!

  • Username (*)
  • Email (*)
  • Website