During big data computing, besides data traversal and grouping and aggregate operations, sometimes we need to retrieve one group of data each time to analyze. For example, analyze sales data by dates, plot sales curve for each product, and study the purchase habit of each client.
1. Fetch data by groups according to the expression
In esProc, we can use cs.fetch(;x) function or cs.skip(;x) function to get or skip records till the value of expression x is changed. By doing so, a group of consecutive data can be obtained. For example, retrieve the data of one product each time and examine the sales:
A | B | |
1 | =file(“Order_Wines.txt”) | |
2 | =file(“Order_Electronics.txt”) | |
3 | =file(“Order_Foods.txt”) | |
4 | =file(“Order_Books.txt”) | |
5 | =[A1:A4].(~.cursor@t().sortx(PID)) | |
6 | =A5.merge@x(PID) | |
7 | for 19 | =A6.skip(;PID) |
8 | =A6.fetch(;PID) | |
9 | >A6.close() |
The cursor in A6 contains the sales data of four kinds of products, which have been sorted by PID. From B7, records of the 20th piece of commodity have been retrieved:
Note that the traversal of cursor data in esProc is executed just once in a one-way street. Thus the data in cursor must be in order when retrieving a single group of records each time as necessary.
2. Fetch data from file cursor segmentally
Both articles – esProc External Memory Computing: Text Files and esProc External Memory Computing: Binary Files – deal with the usage of cs.fetch@z(). As we know, @z option can be used to retrieve file or fetch data from cursor segmentally (or by block). However, when fetching data segmentally, it is esProc that determine how the data are divided and sometimes problems may arise.
First, let’s create a text file to save the above used data which are already sorted by PID as a new binary file Order_Products:
A | |
1 | =file(“Order_Wines.txt”) |
2 | =file(“Order_Electronics.txt”) |
3 | =file(“Order_Foods.txt”) |
4 | =file(“Order_Books.txt”) |
5 | =[A1:A4].(~.cursor@t().sortx(PID)) |
6 | =A5.merge@x(PID) |
7 | =file(“Order_Products”) |
8 | >A7.export@b(A6) |
In the later computation, if retrieving data by segment, we will get the situation given below:
A | |
1 | =file(“Order_Products”) |
2 | =A1.cursor@bz(;,1:100) |
3 | =A2.fetch() |
4 | =A1.cursor@bz(;,2:100) |
5 | =A4.fetch() |
After all data are divided into 100 segments, retrieve data from the 1st segment in A3, and from the 2nd segment in A5, as shown below:
Now a problem appears. For the product with the number B1445, its sales records appear in both groups. If we need to perform the aggregate operation after each time data are fetched, then duplicate product numbers will appear in the return result. Thus re-aggregation will be necessary to get the final result. Such segmental computation is quite common for the parallel computation over big data, but the above problem will make the computation more complicated. To avoid it, we should store data through the way of segmentation by groups.
When storing the binary data with the cursor, simply use the @g option. In this case, the data written into the cursor will be segmented by groups. By doing so, the data from a same group is sure to be fully fetched all at once when retrieving data segmentally. For example:
A | |
1 | =file(“Order_Wines.txt”) |
2 | =file(“Order_Electronics.txt”) |
3 | =file(“Order_Foods.txt”) |
4 | =file(“Order_Books.txt”) |
5 | =[A1:A4].(~.cursor@t().sortx(PID)) |
6 | =A5.merge@x(PID) |
7 | =file(“Order_Products_G”) |
8 | >A7.export@bg(A6;PID) |
Save the data that have been sorted by the product numbers as a binary file Order_Products_G and segment it by groups according to PID. This is slightly different to the method we adopted previously to write the data to the file of Order_Products. Please note that segmental storage is only valid for the binary file. The field by which the data are written into a binary file through the way of segmentation by groups is determined by the parameter used in export@bg() function. It is specified in A8 that A6’s cursor is segmented by groups according to PID, for instance.
At this point, result will be different if we fetch data by segments:
A | |
1 | =file(“Order_Products_G”) |
2 | =A1.cursor@bz(;,1:100) |
3 | =A2.fetch() |
4 | =A1.cursor@bz(;,2:100) |
5 | =A4.fetch() |
The binary data stored like this is fetched using the cursor in the same way as that of other binary data being fetched with the cursor. But their data fetching results are different as a result of segmenting data by groups during file saving in the former case. Data fetched in A3 and A5 are as follows:
Therefore records of the product whose ID is B1445 will all be fetched for segment one and the data for segment two will be fetched from the next product. It can be seen that, after we specify that data are written into the binary file through the way of segmentation by groups, data of the same group will be put in the same segment for being fetched from the cursor. With segmentation by group, the integrity of the data in each group is guaranteed, and the segmental computation over big data becomes simpler and easier.