esProc External Memory Computing: Merge and Join Cursor Data discussed how to have several big data tables joined and retrieve data from them. But we notice that, though it’s convenient to handle big data with the cursor, there’s the restriction of one-way retrieval from front to back. Hence data sorting becomes the prerequisite for joining big data tables on certain conditions. Without the sorting, data loss may happen during the retrieval. A virtual table can be used to prevent this from happening.
1. How to use a virtual table
The following example shows the restriction of using cursors to join big data tables.
A | B | |
1 | =file(“BirthStateRecord”) | =file(“BirthdaySummary”) |
2 | =A1.cursor@b() | =B1.cursor@b() |
3 | =join@x1(A2,Birthday;B2,Birthday) | =A3.fetch(100) |
4 | >A3.close() |
The binary data file BirthStateRecord holds Birthdays and States by IDs of persons:
Another binary data file BirthdaySummary records the number of persons who were born on each date in BirthStateRecord and sorts records by Birthday:
A3 associates the Birthday of each person’s record with the Birthday of corresponding count record. Since the join operation involves two cursors, join@x function is used. @1 option here performs a left join to display records of all persons. By the way, in case of causing confusion, esProc uses the number 1, instead of the lower case letter l, in its function options. After the join, B3 fetches the first 100 records:
It is noticed that most of the records in the result are mismatched. This is because data in the BirthdaySummary cursor can only be fetched once from front to back, and thus the dates of count data being joined move forward only, unable to turn back to be retrieved. To make it right, we need to change the order of data in BirthStateRecord sorted by ID by re-sorting data by Birthday before doing the join. To do this, we can modify A2’s code to =A1.cursor@b().sortx(Birthday) according to the method introduced in esProc External Memory Computing: Principle of Sorting. The re-sorting process is a little time-consuming because it’ll cover all records in the source table. Here’s the result B3 will get:
This way the dates are well matched in the result but the records are disordered. Besides, an approach like this can’t handle a multi-field join.
In view of this, we can convert the data file to a virtual table and use it as a table sequence:
A | B | |
1 | =file(“BirthStateRecord”) | =file(“BirthdaySummary”) |
2 | =A1.cursor@b() | =B1.create@x(;1000) |
3 | =A2.switch(Birthday,B2) | =A3.fetch(100) |
4 | >A3.close() |
B2 doesn’t generate a cursor for B1’s data file. Instead, it creates a virtual table based on the binary table f using f.create@x(Fi,…;n), which will use the first field of the data file as the primary key if Fi,…, the primary key field, is not specified. To create a virtual table, we must use a binary data file rather than a normal one. The binary file has already been discussed in esProc External Memory Computing: Binary Files. With create@x function, we can set the number of records in buffer area through n. Thus during the execution f.’s data won’t be loaded into the memory at a time but imported in batches by utilizing the buffer, and can be processed like a table sequence. From B2’s binary file, for example, we can retrieve data directly:
Though a mere 1000 record is set for the buffer in the creation of virtual table, data can be retrieved from the virtual table as needed in batches and viewed like any table sequence.
B2 is not a cursor, so there’s no related restriction. A3 then joins the virtual table with A2’s cursor using switch function. After the joining B3 fetches the first 100 records:
At this time the data association is not affected any more by the cursor mode of data retrieval, and we can get the desired result correctly.
2. Comparison between virtual table and table sequence
There’s difference between a virtual table and a table sequence, although they work the same way. A virtual table is a dummy table sequence made up of bytes in the memory. We can’t add, delete and modify data on it, or change its structure.
Through esProc External Memory Computing: The Basics of the Use of Cursor, we’ve learned that table sequences can be joined using cs.switch function. The same function is used in the example of the previous section to have the virtual table joined. The batch retrieval of data from the virtual table according to the buffer may result in a conspicuous lower efficiency, compared with the data retrieval through a table sequence. An alternative is to create a virtual table using f.create (Fi,…) without @x option and a buffer set up. Now the data of the virtual table will be entirely loaded into the memory as bytes but the desired data will only be formed when needed. This method of using a virtual table causes more memory usage than using it with a buffer, yet it can effectively improve the data fetching efficiency.
The following example is used to compare the three methods in efficiency:
A | B | |
1 | =file(“BirthStateRecord”) | =file(“BirthdaySummary”) |
2 | =now() | |
3 | =A1.cursor@b() | =B1.create@x(;1000) |
4 | =A3.switch(Birthday,B3) | 0 |
5 | for A4,10000 | >B4+=A5.count() |
6 | =interval@ms(A2,now()) | |
7 | =now() | |
8 | =A1.cursor@b() | =B1.import@b() |
9 | =A8.switch(Birthday,B8) | 0 |
10 | for A9,10000 | >B9+=A10.count() |
11 | =interval@ms(A7,now()) | |
12 | =now() | |
13 | =A1.cursor@b() | =B1.create() |
14 | =A13.switch(Birthday,B13) | 0 |
15 | for A14,10000 | >B14+=A15.count() |
16 | =interval@ms(A12,now()) |
The above code uses the three methods to fetch data from the cursor generated from performing a foreign key association with switch function and to count the records. From line 2 to line 6, a virtual table with the buffer is used; from line 7 to line 11, a table sequence is used; and from line 12 to line 16, a virtual table whose data will be entirely loaded into the memory is used. After execution, B4, B9 and B14 get the same count:
A6, A11 and A16 respectively calculate the approximate time (ms) each method spends. Here are the results:
When a virtual table with a buffer is used, the data in the buffer could be refreshed frequently, which can significantly affect the computational performance. So the first method spends rather conspicuously more time than others. With the third method, all the information of the virtual table is loaded into the memory. The method uses just slightly longer time than the one using a table sequence. The fact is that the bigger the buffer the higher the efficiency, even though the buffer will take up more memory.
Despite the highest efficiency achieved in handling operations like foreign key association by using a table sequence, the method achieves the highest memory usage too because data of the whole table sequence is stored in the memory. Now let’s compare the degrees of memory usage for the use of a table sequence and a virtual table.
The binary data file BirthStateRecord stores 1,000,000 rows of data in 3 fields:
The following code generates test data files based on BirthStateRecord:
A | B | C | |
1 | =file(“BirthStateRecord”) | ||
2 | for 20 | =file(“testData”+string(A2)) | |
3 | for A2 | =A1.cursor@b() | |
4 | >B2.export@ab(C3) |
A2 loops 20 times to generate 20 data files. Then the code has the data of BirthStateRecord written into the nth data file n times. For example, testData11 will receive 11,000,000 rows of data:
First let’s look at what will happen when using a table sequence:
A | B | C | |
1 | for 30 | =”testData”+string(A1) | >output(B1) |
2 | =file(B1) | ||
3 | =B2.import@b() | =B3.count() |
Here each time a data file is retrieved as a table sequence and the records in it are counted. Once a table sequence fails to be retrieved into the memory, the error occurs and computation stops working properly:
By checking B1’ data, we know that which data file has gone wrong:
This means the retrieval went well to testData4, which is retrieved as a table sequence of 4,000,000 records of 3 columns. Of course the degrees of memory usage of table sequences vary, depending on data size and the number of fields of the table sequence.
Then we’ll take a look at the performance of the virtual table with the same data files:
A | B | C | |
1 | for 30 | =”testData”+string(A1) | >output(B1) |
2 | =file(B1) | ||
3 | =B2.create() | =B3.count() |
Likewise, when memory runs out we can check the situation in B1:
So you see that a maximum of 19,000,000 records can be retrieved with a virtual table. This is several times more than the table sequence can handle. There’s no upper limit for the amount of data that can be handled if we use f.create@x(Fi,…;n), which establishes a buffer, when creating the table sequence.
Therefore a virtual table will do a better job when we are handling big data. We can create a virtual table for the data table that can’t be completely held in the memory and import the virtual table data as bytes into the memory. By doing so, both the accomplishment of the computation and a good efficiency can be guaranteed. If the virtual table data can’t be entirely loaded into the memory, we can set up an appropriate buffer to make it workable.