esProc File Computing: Parallel Grouping and Aggregation

Uncategorized 791 0

The article esProc File Computing: Parallel Query and Filter explained how to query and filter data. Here let’s look at the case of combining the article’s topic with grouping and aggregation. Still we only discuss cases with relatively small result sets that can be loaded into memory in one go.  

Steps of querying and filtering data and then grouping and aggregating it using multithreads are: Each of the threads queries and filters and groups and aggregates a part of the data; then results of all threads’ are combined and grouped and aggregated; the result will then be filtered in main program. The following example is cited to illustrate this. 

As big data is generally stored in files, an Orders.b file will be taken as an example. Its data is as follows:

ORDERID CLIENT     SELLERID AMOUNT ORDERDATE NOTE

1       287  47     5825         2013-05-31       gafcaghafdgie f ci…

2       89     22     8681         2013-05-04       gafcaghafdgie f ci…

3       47     67     7702         2009-11-22       gafcaghafdgie f ci…

4       76     85     8717         2011-12-13       gafcaghafdgie f ci…

5       307  81     8003         2008-06-01       gafcaghafdgie f ci…

6       366  39     6948         2009-09-25       gafcaghafdgie f ci…

7       295  8       1419         2013-11-11       gafcaghafdgie f ci…

8       496  35     6018         2011-02-18       gafcaghafdgie f ci…

9       273  37     9255         2011-05-04       gafcaghafdgie f ci…

10     212  0       2155         2009-03-22       gafcaghafdgie f ci…

Note: In the above data, note field exists only for the purpose of increasing each record’s length, but does not have any practical meaning. 

You need to group ordering records in the year of 2013 by client and sellerid and compute the number of orders (ordered) and the sum of order amount (amount); and finally filter the result according to the criterion that the order amount is greater than 500,000. 

Since Orders.b contains relatively big data, it needs to be processed by segments. First write the script group.dfx in esProc for multithreaded computing:

 

A

B

1

4  

2

fork to(A1) =file(“/files/data/orders.b”).cursor@bz(ORDERID,CLIENT,SELLERID,AMOUNT,ORDERDATE;,A2:A1)

3

  =B2.select(year(date(ORDERDATE))==2013)

4

  =B3.groups(CLIENT:client,SELLERID:sellerid;count(ORDERID):corderid,sum(AMOUNT):samount)

5

  result B4

6

=A2.conj()  

7

=A6.groups(client,sellerid;sum(corderid):corderid,sum(samount):samount)

8

=A7.select(samount>500000)

9

result A8

A1: Set the number of parallel threads for data aggregation as 4.

A2: The code from B2 to B5 is executed using multithreads through the keyword fork. There are 4 threads, which get A2’s value as 1, 2, 3, 4 respectively.

B2: Use cursor function to divide the file into 4 segments roughly and get the cursor of A2 (only the desired fields are fetched).

B3: Query and find data in the year of 2013.

B4: Group and aggregate the cursor data.

B5: Return B4, the filtering result of the current thread.

A6: The returned results of the four threads are concatenated in the main thread.

A7: Group and aggregate the concatenation result.

A8: Filter A7’s grouped data according to the criterion that order amount is greater than 500,000

A9: Return the final aggregate result to the external program. 

Save the esProc script as group.dfx when it is finished. It is then called by the external program via esProc JDBC. See esProc Tutorial for the calling method. 

For the performance test of data query and filter approach using esProc multithreading, see Performance Test of esProc File Traversal Algorithm.

FAVOR (0)
Leave a Reply
Cancel
Icon

Hi,You need to fill in the Username and Email!

  • Username (*)
  • Email (*)
  • Website