With esProc, you can handle the structured big data easily, and perform data computing and analysis without difficulty. Because the big data table cannot be loaded into the memory all at once, the basic solution is using cursor to handle it.
1. Basic principle of cursor
The esProc cursor is like the database cursor in the stored procedure: In fact, the cursor is a pointer that points to a row of data. Records can thus be retrieved one by one according to its position, yet will not be totally returned. Through the cursor, the data retrieval can be completed progressively to ensure the system memory won’t become overburdened.
In order to facilitate the usage, esProc cursor does not need to access records one by one. However, esProc also makes rules for the use of cursor. In esProc cursor, moving backward is forbidden, only moving forward is allowed, which is similar to TYPE_FORWARD_ONLY type of result set in JDBC. In this way, esProc cursor can only traverse the records in a table sequence once, and the related computations will be completed in this traversal. This differs from the computational mode of normal table sequences. Once a traversal is completed, the cursor will close automatically and become invalid. So, records can no longer be fetched from it anymore.
In esProc,a cursor can be generated from a big database table or a big data file. Sometimes, you can generate a cursor using the existing TSeq in the memory for the computation. For example:
A | |
1 | =file(“PersonnelInfo.txt”) |
2 | $select * from CITIES |
3 | =A1.cursor() |
4 | =demo.cursor(“select * from EMPLOYEE”) |
5 | =A2.cursor() |
In which, A3 holds a file cursor, A4 holds a database cursor, and A5 holds an in-memory record sequence cursor.
When using the cursor, the function cs.fetch() can be used to retrieve one or multiple consecutive rows according to the position(s) specified by the cursor for computing or debugging. You can see the article esProc External Memory Computing: Concept of Cursor to learn different types of cursor and their differences.
2. Inheritance feature of the cursor
When using cursor to handle data computation, the result returned by cursor is usually the records from a data table or result set. However, the data in cursor is usually not the desired final result, and needs reprocessing.
When handling the query result from the cursor, you may need to perform filtering, summarizing, sorting, and other jobs, sometimes even data combining or merging for multiple cursors. To do these jobs, in esProc, you can directly use a cursor to generate a new one, yet with no need to execute data query. The data retrieval will be performed when you actually fetch the data. For example, to complete a certain data processing job, use cursor A to generate new cursor B. Cursor B will use A to retrieve data, process the retrieved data and return the result. Based on cursor B, you can generate cursor C as necessary, and further process the data in B……For join and merge operations, a new cursor can be generated from multiple cursors.
In the mode of cascaded inheritances, the complex data processing job can be divided into multiple steps to accomplish, with a specific purpose for each step. Each cursor retrieves data through the original cursor when doing the processing. So, data is fetched actually from the original cursor for each data computation. Such mode ensures that the original data can only be traversed once. This avoids data conflict as well as reduces the frequency of database access while keeping a high computational efficiency.
The cascaded inheritance approach processes data in the cursor without extra operations. In addition, each stage of cursor object only takes very limited amount of memory. So, this mode will not compromise the performance of accessing the external storage, and has little impact on the computational performance by adding more inheritance stages.
3. Basic cursor computations
3.1. Basic usage of cursor
Once the cursor is created, the cs.fetch() function can be used to retrieve data from data table. You can also use cs.skip() function to skip multiple rows . To close the cursor before all data is traversed, you can use cs.close(). For example:
A | B | C | D | |
1 | =file(“PersonnelInfo.txt”) | |||
2 | =A1.cursor@t() | |||
3 | for | =A2.fetch(1) | ||
4 | if B3==null | break | ||
5 | else | =age(B3.Birthday) | ||
6 | if C5>40 | break | ||
7 | >A2.close() | |||
8 | =if(B3==null,null,B3.Name) |
This cellset is designed to find the name of the first employee who is above 40. Once the file cursor is generated in A2, run a loop in A3, and retrieve one employee record from B3 each time. If all records are traversed with the cursor, then the loop will break in C4; otherwise C6 will judge if the employee is over 40. If the condition is met, then exit the loop and close the cursor. In A8, you will see the result:
Usually a cursor retrieves multiple records at one time:
A | B | C | |
1 | =file(“PersonnelInfo.txt”) | ||
2 | =A1.cursor@t() | 0 | |
3 | for | =A2.fetch(100) | |
4 | if B3==null | break | |
5 | else | =B3.count(age(B3.Birthday)>40) | |
6 | >B2=B2+C5 | ||
7 | =B2 |
In the above program, all data in the text file cursor in A2 will be traversed, retrieving 100 rows each time and computing the total number of employees above 40. The final result can be viewed in A7:
Because the cursor automatically close when all data is traversed, there is no need to use cs.close(). Please note that the cs.fetch() returns a table sequence each time, instead of returing an RSeq composed of records from a same TSeq.
3.2. Filtering cursor
When you need to filter the desired data out from the cursor, you can use cs.select() function . Thus, the above problem can be solved with the following method:
A | B | C | |
1 | =file(“PersonnelInfo.txt”) | ||
2 | =A1.cursor@t() | 0 | |
3 | =A2.select(age(Birthday)>40) | ||
4 | for | =A3.fetch(100) | |
5 | if B4==null | break | |
6 | else | >B2=B2+B4.len() | |
7 | =B2 |
The final result in A7 is the same as above:
Because generally the result of filtering big data in the cursor is also the big data, cs.select() will return a cursor too. Further processing of the returned big data often includes aggregate operation and others. If there is small amount of data in the result, then all can be returned at once. For example, find female employees from California who are above 40:
A | |
1 | =file(“PersonnelInfo.txt”) |
2 | =A1.cursor@t() |
3 | =A2.select(age(Birthday)>40) |
4 | =A3.select(State==”CA”) |
5 | =A4.select(Gender==”F”) |
6 | =A5.fetch() |
In A6, you will see the results:
In the computation, multiple inheritance of cursor is used for filtering step by step. Data can be retrieved from text data table and computed to get the required result only if they are retrieved in A6. This method almost will not affect performance.
3.3. Creating and modifying cursor
Sometimes, the data retrieved from big data table is not the final result you need, and further computation is required to get it. In such cases, you can use the cs.new() function to generate a new cursor:
A | |
1 | =file(“PersonnelInfo.txt”) |
2 | =A1.cursor@t() |
3 | =A2.new(ID,Name,age(Birthday):Age,Gender,State) |
4 | =A3.select(Age>40) |
5 | =A4.select(State==”CA”) |
6 | =A5.select(Gender==”F”) |
7 | =A6.fetch() |
Likewise, data will be fetched from text data table at A7’s instruction and computed to get the required data. The result of A7 is shown below:
Besides the cs.new() for generating new data, cs.run() function can be used to modify the fields of records directly. For example:
A | |
1 | =file(“PersonnelInfo.txt”) |
2 | =A1.cursor@t() |
3 | =A2.new(ID,Name,age(Birthday):Age,Gender,State) |
4 | =A3.select(Age>40) |
5 | =A4.select(State==”CA”) |
6 | =A5.select(Gender==”F”) |
7 | =A6.run(Gender=”Female”,State=”California”) |
8 | =A7.fetch() |
The result in A8 is shown below:
3.4. Foreign key in cursor
Sometimes, the big data table requires foreign keys to relate to data in other tables. In this case, you can use cs.switch() function, for example:
A | |
1 | =file(“PersonnelInfo.txt”) |
2 | =A1.cursor@t() |
3 | =A2.select(age(Birthday)>40) |
4 | =A3.select(left(Name, 1)==”W”) |
5 | =A4.select(Gender==”F”) |
6 | $select ABBR, NAME, CAPITAL from STATES |
7 | =A5.switch(State, A6) |
8 | =A7.fetch() |
In this example, find the profiles for all female employees whose name initials are W and above 40. A7 sets the State field as the foreign key, and references the records of table sequence in A6.The result of A8 is shown below: