esProc File Computing: Parallel Query and Filter

Raqsoft esProc provides the ability of performing file-based computation. It supports multithreaded parallel processing to deal with problems involving relatively big data. The multiprocessing can make full use of the computational power of the machine’s multi-core CPU to achieve an almost equal or better performance than the conventional database.

Here only the case involving a relatively small result set, that is, the memory can hold the entire result of data computing, is explored.

The following is the architecture of esProc multithreaded parallel processing:

esProc_file_query_filter_1

esProc Multithreaded Parallel Processing Architecture

As shown in the above figure, esProc distributes a task to multiple subscripts through a main script. Each subscript accesses a part of the local data and computes it. After the subscripts finish their computations, they return results to the main script that gets the final result and passes it to the host application, such as a reporting tool.

Each subscript is a thread. Theoretically the number of multithreaded parallel tasks a server allows is determined by the number of CPU cores as well as the performance of a parallel hardware. The more the number of CPU cores a server has and the better the hardware’s parallel access ability is, the more the number of parallel tasks a server can hold and the faster the task is performed. So the multithreaded parallel processing can take the greatest advantage of the machine’s computational power.

Steps of data query and filter using multithreads are: Each thread handles querying a part of data and then results of all thread’s queries are combined. Here is an example. As big data is commonly stored in a file, the Orders.txt file is used to illustrate this, as shown below:

ORDERID CLIENT SELLERID AMOUNT ORDERDATE NOTE

1 287 47 5825 2013-05-31 gafcaghafdgie f ci…

2 89 22 8681 2013-05-04 gafcaghafdgie f ci…

3 47 67 7702 2009-11-22 gafcaghafdgie f ci…

4 76 85 8717 2011-12-13 gafcaghafdgie f ci…

5 307 81 8003 2008-06-01 gafcaghafdgie f ci…

6 366 39 6948 2009-09-25 gafcaghafdgie f ci…

7 295 8 1419 2013-11-11 gafcaghafdgie f ci…

8 496 35 6018 2011-02-18 gafcaghafdgie f ci…

9 273 37 9255 2011-05-04 gafcaghafdgie f ci…

10 212 0 2155 2009-03-22 gafcaghafdgie f ci…

…

In the above data, note field exists only for the purpose of increasing each record’s length, but does not have any practical meaning.

You need to query and filter the data according to the criteria “sellerid=1 and client=50 and orderdate>2013” and passes the result to the external Java program.

Because Orders.txt contains a great amount of data, it needs to be divided into multiple segments for being processed. First you use esProc to write the script select.dfx for multithreaded parallel query. The following is the script:

	A	B
1	4
2	fork to(A1)	=file(“/tools/data/orders.txt”).cursor@tz(;,A2:A1)
3		=B2.select(sellerid==1 && client==50 && year(date(orderdate))>=2013).fetch()
4		result B3
5	=A2.conj()
6	result A5

A1: Set the number of parallel tasks as 4.

A2: The code from B2 to B4 is executed using multithreads through the keyword fork. There are 4 threads, which get A2’s value as 1, 2, 3, 4 respectively.

B2: Use cursor function to divide the file into 4 segments roughly and get the cursor of A2 (only the desired fields are fetched).

B3: Filter data in the cursor.

B4: Return B3, the filtering result of the current thread.

A5: The returned results of the four threads are concatenated in the main thread.

A6: Return the final result to the external program.

Save the esProc script as select.dfx when it is finished. It is then called by the external program via esProc JDBC. See esProc Tutorial for the calling method.

If you converted the text file into the binary format esProc provides, the performance would have been increased more. The conversion code is as follows:

	A	B
1	=file(“/tools/data/orders.txt”).cursor@t()
2	=file(“/tools/data/orders.b”).export@b(A1)

A1: Create a text file cursor.

A2: Export data in the text file cursor to that of binary format.

Modify select.dfx into this:

	A	B
1	4
2	fork to(A1)	=file(“/tools/data/orders.b”).cursor@bz(;,A2:A1)
3		=B2.select(sellerid==1 && client==50 && year(date(orderdate))>=2013).fetch()
4		result B3
5	=A2.conj()
6	result A5

You can see that the script is almost the same except that options used in B2’s cursor function have become @bz for retrieving binary data.

On the hardware of the same standard, it takes 24 seconds to complete querying and filtering data of 3.4G size in a text file, but only 4 seconds in a binary file.

For the performance test of data query and filter approach using esProc multithreading, see esProc Performance Test of File Traversal Algorithm. According to the result of testing and comparison with Oracle, when data volume is less than the available memory space, Oracle has a better performance; while when data volume exceeds the usable memory space, usually esProc outperforms Oracle.

The above explored the case where parallel program is run on a single computer. For situations where even bigger data is involved, you can use a cluster of servers in esProc to further improve performance through multi-computer parallel processing system.

Hi，You need to fill in the Username and Email！