esProc Parallel Computing: The Remote Cursor

Course 981 0

Parallel computing allows executing a computation on big data by distributing it to nodes. If the data distributed to each node is still in large volume, it can be returned from the node with the remote cursor. We’ll now learn the usage of the remote cursor as well as its features.

1. The usage

The servers the remote cursor uses are the same as those used in a common cluster computation. You can see esProc Parallel Computing: The Serverfor details about their running and configuration. After servers are started, the remote cursor service will be able to return cursors during parallel computation using the return statement.

For example, the following StockCursor.dfx reads the stock record file of a certain year and returns it as the cursor:

  A B
1 >output@t(“calc begin, year: “+string(arg1)+”. “) =file(“StockRecord”+string(arg1)+”.txt”)
2 if !(B1.exists()) end “File of year “+string(arg1)+” is not found!”
3 =B1.cursor@t() >output@t(“calc finish, year: “+string(arg1)+”. “)
4 return A3  

Judge if the data file for computation exists in the current server. Return the result by end statement if it is false, so that the main program can redistribute the computational task to another server. If the result is true, then return the file cursor. In the cellcet program, arg1 is the parameter for passing in the year of which the statistics are to be calculated.

esProc_parallel_remote_cursor_1

In this example, three servers are running on two computers and stock records for different years are saved in multiple data files, whose distribution goes like this: Server I (192.168.0.86:4001) stores data of 2010 and 2014; Server II (192.168.0.66:4001) stores data of 2011 and 2013 and Server III (192.168.0.86:4004) stores data of 2012. The number of parallel tasks – callxTask – set for each of the three servers is different, respectively 2, 1 and 4; while the number of machines for the parallel processing – nodeTask – is 4.

The main program calls the cellset file as follows:

  A
1 [192.168.0.86:4001,192.168.0.66:4001,192.168.0.86:4004]
2 [2010,2011,2012,2013,2014]
3 =callx(“StockCursor.dfx”,A2;A1)
4 =A3.conj@x()
5 =A4.select(SID==124051)
6 =A5.fetch()

In this cellset, A3 executes parallel computation via callx to run StockCursor.dfx on the servers. The file must exist in the mainPath configured for each server. Since every subtask returns a cursor, the result of A3 will be a sequence of remote cursors after the whole computation is completed:

esProc_parallel_remote_cursor_2

Cursors in the cursor sequence are concatenated in A4 with conj@x, so that data fetching will go through every cursor one after another. A5 selects from the cursors the one in which data of stock 124051 is stored. A6 fetches data from it, the result is as follows: 

esProc_parallel_remote_cursor_3

As can be seen from the result, data of every year is concatenated orderly.

Then we’ll look at the information output from each server.

esProc_parallel_remote_cursor_4

Subtasks will be assigned to servers according to the order of the corresponding parameters. If one of the servers has already had the callxTask set for it, then move on to the next server for the distribution. If the data file of a certain year for the computation of StockCursor.dfx does not exist in the server, the information will be returned to the main program via the end statement and the subtask will be redistributed to another one. As can be seen from the output information, a subtask is considered completed when a remote cursor is returned to the main program. So each subtask is quickly completed despite the fact that data hasn’t been really returned from the cursor.

In order to avoid distributing a subtask to a wrong server where the required data file does not exist, we can prescribe strictly which servers a subtask should use through callx@a. It is the same approach used in a common parallel computation. You can find concrete examples in esProc Parallel Computing: Cluster Computing.

While the main program is executing fetch statement, remote cursors are still working on servers, on which data is fetched from each cursor and returned to the main program. When every cursor’s data is all fetched out, remote cursors are automatically closed.

2. The auto-close

In the main program, if cursors are generated but data is not fetched, or not completely fetched, as the following cellset shows:

  A
1 [192.168.0.86:4001,192.168.0.66:4001,192.168.0.86:4004]
2 [2010,2011,2012,2013,2014]
3 =callx(“StockCursor.dfx”,A2;A1)
4 =A3.conj@x()
5 =A4.select(SID==124051)
6 /=A5.fetch()

Comment out A6’s fetch statement to merely generate cursors in A5. In this case the information output from each server is as follows:

esProc_parallel_remote_cursor_5

We can see that remote cursors will commit self-destruction if they are not accessed within a long time after the creation. In the server’s configuration file unit.xml, we can set the expiration time for the remote cursor with ProxyTimeOut. If the expiration time is zero, or hasn’t been set, the program will not perform an expiration check. The related setting is introduced in esProc Parallel Computing: The Server.

After a remote cursor closes because of a time-out, fetching data from it will result in an error. The program considers the server’s returning of a cursor to the main program as the accomplishment of a subtask, so if the cursor closes like that the corresponding subroutine will not be run again. To avoid such a problem, we should appropriately set the expiration time as needed.

3. Merge remote cursors

Parallel computation using the remote cursor will return a sequence of remote cursors that will be combined by the main program. In the previous example, the remote cursors returned from multiple servers were concatenated in order, but they can be merged as well

The following StockCursor2.dfx file retrieves stock record data of a certain year and sorts them by stock number SID before returning a cursor:

  A B
1 >output@t(“calc begin, year: “+string(arg1)+”. “) =file(“StockRecord”+string(arg1)+”.txt”)
2 if !(B1.exists()) end “File of year “+string(arg1)+” is not found!”
3 =B1.cursor@t() >output@t(“calc finish, year: “+string(arg1)+”. “)
4 return A3.sortx(SID)  

A remote cursor can be various types, such as file cursor, database cursor and in-memory record sequence cursor. The remote cursor A4 returns to the main program is actually a temporary file cursor generated during the execution of sortx.

In the main program, cursor data of every year is merged orderly by SID:

  A
1 [192.168.0.86:4001,192.168.0.66:4001,192.168.0.86:4004]
2 [2010,2011,2012,2013,2014]
3 =callx(“StockCursor2.dfx”,A2;A1)
4 =A3.merge@x(SID)
5 =A4.fetch(;SID)
6 >A4.close()

Note that data in every cursor of the cursor sequence should be sorted before cursors are merged; otherwise error will occur. A5 only fetches the transaction data of the first stock in five years: 

esProc_parallel_remote_cursor_6

As data in the remote cursor hasn’t been completely fetched out, A6 needs to call close to close it. If not closed, it will perform auto-close when its time expires.

FAVOR (0)
Leave a Reply
Cancel
Icon

Hi,You need to fill in the Username and Email!

  • Username (*)
  • Email (*)
  • Website