A Code Example of External Memory Sorting in esProc

Uncategorized 863 0

We often need to sort records in a table for data analysis and computing. esProc supports sort function to sort sequences or table sequences. External memory sorting is required when the data being sorted is too huge to be loaded into the memory all together and the ordinary sorting method cannot be used. 

For example, a text file, Order_Foods.txt, contains 50,000 pieces of information of food orders: 

esProc_file_compute_sort_1

Sort the ordering data according to the following instructions:

①     Sort data by food names in ascending order.

②     Sort data by order dates in ascending order; then sort data with the same date by food names in descending order.

③     Sort data by order dates in descending order; then sort data with the same date by order amount in descending order. 

In fact 50,000 rows of data cannot be counted huge. We just take it as an example to illustrate how to sort big data. In practical business, we may need to handle several hundred million records or the data files of a couple of dozen or nearly a hundred GB size.

esProc uses cursors to process big data. With the cursor, records in a data table will not be fetched all at once. Only one or multiple records will be fetched each time according to the setting. By doing so, memory overflow can be avoided during data computing. Since only a part of the data is fetched each time, sort and group operations on all cursor data cannot be completed in one step. Using external memory, esProc cs.sortx(x…;n) function can fetch data from cursors in batches, process each batch separately and store the results in the external files, according to n, the pre-specified number of rows in buffer area . In this way, memory overflow won’t happen. The result of sorting data with the cursor is still a cursor. The way data is fetched from it is exactly the same as that data in any other cursor is fetched.

Let’s look at how to sort data in Order_Foods.txt:

  A
1 =file(“Order_Foods.txt”)
2 =A1.cursor@t()
3 =A2.sortx(PName;1000)
4 =A3.fetch(1000)
5 >A3.close()
6 =A1.cursor@t()
7 =A6.sortx(Date,-PName;1000)
8 =A7.skip(20000)
9 =A7.fetch(1000)
10 >A7.close()
11 =A1.cursor@t()
12 =A11.sortx(-Date,-PName;1000)
13 =A12.fetch(1000)

A2 uses text file Order_Foods.txt to generate a cursor and makes the first row the column names. A3 uses sortx function to sort data in the cursor in A2 by food names in ascending order. Meanwhile, the number of rows in buffer area is set as 1,000, which indicates data sorting will be performed every 1,000 records and the result will be stored temporarily as an external file. The final result of data sorting is also a cursor:

esProc_file_compute_sort_3

While this cursor is being used to fetch data, it merges all generated external files automatically and fetches the desired data. As A4 that fetches the first 1,000 rows from the sorted data: 

esProc_file_compute_sort_4

A5 closes the cursor and the external files generated by sort operation are automatically and simultaneously deleted. Since the cursor in A3 is generated by the cursor in A2, the latter will close if the former closes. If we want to sort data by order dates in ascending order and then by food names in descending order, the cursor needs to be re-created in A6. For more information about sorting big data, see esProc External Memory Computing: Principle of Sorting

A7 sorts data in the cursor as required using external memory. We can specify multiple sorting fields or sorting expressions for the sortx function. If an expression requires sorting in descending order, put a negative sign before it. The sorting result in A7 is still a cursor:

esProc_file_compute_sort_5

This time, A8 skips 20,000 records; A9 fetches data from the 20,001th row to the 21,000th row: 

esProc_file_compute_sort_6

Having done its job, the cursor should be closed in A10 to remove the external files instantly. If all data in the cursor is fetched, it will close automatically and it is not necessary to call cs.close() instruction.

Adding a negative sign before a sorting expression in sorting data in descending order applies to not only the string-type data in A7, but the date type data or numerical data. A12, which sorts data in descending order, uses date type a field and a field of numerical data. A13 fetches the first 1,000 rows of sorted data:

esProc_file_compute_sort_7

FAVOR (0)
Leave a Reply
Cancel
Icon

Hi,You need to fill in the Username and Email!

  • Username (*)
  • Email (*)
  • Website