esProc File Computing: Parallel Grouping and Aggregation

The article esProc File Computing: Parallel Query and Filter explained how to query and filter data. Here let’s look at the case of combining the article’s topic with grouping and aggregation. Still we only discuss cases with relatively small result sets that can be loaded into memory in one go.

Steps of querying and filtering data and then grouping and aggregating it using multithreads are: Each of the threads queries and filters and groups and aggregates a part of the data; then results of all threads’ are combined and grouped and aggregated; the result will then be filtered in main program. The following example is cited to illustrate this.

As big data is generally stored in files, an Orders.b file will be taken as an example. Its data is as follows:

ORDERID CLIENT SELLERID AMOUNT ORDERDATE NOTE

1 287 47 5825 2013-05-31 gafcaghafdgie f ci…

2 89 22 8681 2013-05-04 gafcaghafdgie f ci…

3 47 67 7702 2009-11-22 gafcaghafdgie f ci…

4 76 85 8717 2011-12-13 gafcaghafdgie f ci…

5 307 81 8003 2008-06-01 gafcaghafdgie f ci…

6 366 39 6948 2009-09-25 gafcaghafdgie f ci…

7 295 8 1419 2013-11-11 gafcaghafdgie f ci…

8 496 35 6018 2011-02-18 gafcaghafdgie f ci…

9 273 37 9255 2011-05-04 gafcaghafdgie f ci…

10 212 0 2155 2009-03-22 gafcaghafdgie f ci…

…

Note: In the above data, note field exists only for the purpose of increasing each record’s length, but does not have any practical meaning.

You need to group ordering records in the year of 2013 by client and sellerid and compute the number of orders (ordered) and the sum of order amount (amount); and finally filter the result according to the criterion that the order amount is greater than 500,000.

Since Orders.b contains relatively big data, it needs to be processed by segments. First write the script group.dfx in esProc for multithreaded computing:

	A	B
1	4
2	fork to(A1)	=file(“/files/data/orders.b”).cursor@bz(ORDERID,CLIENT,SELLERID,AMOUNT,ORDERDATE;,A2:A1)
3		=B2.select(year(date(ORDERDATE))==2013)
4		=B3.groups(CLIENT:client,SELLERID:sellerid;count(ORDERID):corderid,sum(AMOUNT):samount)
5		result B4
6	=A2.conj()
7	=A6.groups(client,sellerid;sum(corderid):corderid,sum(samount):samount)
8	=A7.select(samount>500000)
9	result A8

A1: Set the number of parallel threads for data aggregation as 4.

A2: The code from B2 to B5 is executed using multithreads through the keyword fork. There are 4 threads, which get A2’s value as 1, 2, 3, 4 respectively.

B2: Use cursor function to divide the file into 4 segments roughly and get the cursor of A2 (only the desired fields are fetched).

B3: Query and find data in the year of 2013.

B4: Group and aggregate the cursor data.

B5: Return B4, the filtering result of the current thread.

A6: The returned results of the four threads are concatenated in the main thread.

A7: Group and aggregate the concatenation result.

A8: Filter A7’s grouped data according to the criterion that order amount is greater than 500,000

A9: Return the final aggregate result to the external program.

Save the esProc script as group.dfx when it is finished. It is then called by the external program via esProc JDBC. See esProc Tutorial for the calling method.

For the performance test of data query and filter approach using esProc multithreading, see Performance Test of esProc File Traversal Algorithm.

Hi，You need to fill in the Username and Email！