During data computing in esProc, raw data, intermediate data, and the final resulting data consume certain system resources. That includes the memory consumption, database connection usage and the space occupation by the file objects. Too much resource consumption may cause memory overflow, database access failure and other problems to the computation. It’s important during the computing process to timely remove useless raw data and intermediate data, close unwanted database connection and delete unnecessary file objects so that resources can be released.
1. Variables
esProc supports user-defined variables and using cell values as variables. Both types of variables occupy memory space with data stored in them. If they consume a large amount of memory by holding a big volume of data, computations would thus be affected, resulting in low execution speed and even memory overflow.
The following program is used to test the normal memory consumption by variables:
A | B | |
1 | $ select NAME from CITIES | 15000000 |
2 | =A1.(NAME) | =A2.len() |
3 | 2000-1-1 | =create(ID,Name,City,Birthday) |
4 | for B1 | >B3.insert(0,#A4,”N”/#A4,A2(rand(B2)+1),after@d(A3,-rand(10000))) |
5 | / | =create(ID,Name,City,Birthday) |
6 | for B1 | >B5.insert(0,#A6,”N”/#A6,A2(rand(B2)+1),after@d(A3,-rand(10000))) |
7 | / | =create(ID,Name,City,Birthday) |
8 | for B1 | >B7.insert(0,#A8,”N”/#A8,A2(rand(B2)+1),after@d(A3,-rand(10000))) |
9 | / | =create(ID,Name,City,Birthday) |
10 | for B1 | >B9.insert(0,#A10,”N”/#A10,A2(rand(B2)+1),after@d(A3,-rand(10000)) |
B3, B5, B7, B9 and B11 respectivley create a 4-field testing table sequence. Line 4,6,8,10 and 12 run loops to insert 15,000,000 records to each table sequence. The data types of the field values include integer, string and date. Each record will occupy space of about 100 bytes after it is objectified. A table sequence should be cleared from the cell used as the variable as soon as it is generated for the sake of smooth execution. With a memory space of 4G, momory overflow happens as B7 tries to geneate the third table sequence.
In actual execution, a table sequence, after it is generated, will be stored in a file and this step of computation is over. At that point it can be deleted from the cell, as shown below:
A | B | |
1 | $ select NAME from CITIES | 15000000 |
2 | =A1.(NAME) | =A2.len() |
3 | 2000-1-1 | =create(ID,Name,City,Birthday) |
4 | for B1 | >B3.insert(0,#A4,”N”/#A4,A2(rand(B2)+1),after@d(A3,-rand(10000))) |
5 | >B3=null | =create(ID,Name,City,Birthday) |
6 | for B1 | >B5.insert(0,#A6,”N”/#A6,A2(rand(B2)+1),after@d(A3,-rand(10000))) |
7 | >B5=null | =create(ID,Name,City,Birthday) |
8 | for B1 | >B7.insert(0,#A8,”N”/#A8,A2(rand(B2)+1),after@d(A3,-rand(10000))) |
9 | >B7=null | =create(ID,Name,City,Birthday) |
10 | for B1 | >B9.insert(0,#A10,”N”/#A10,A2(rand(B2)+1),after@d(A3,-rand(10000)) |
As this is just a test, generated table sequences won’t be exported but just be cleared in A5, A7 and A9 by setting the values of B3, B5 and B7 as nulls. By removing these table sequences, resources are released and memory overlow is avoided to pave the way for subsequent compuations. After the code is exected, the intermediate data in B3, B5 and B7 will have been cleared:
But if multiple variables store the same data object, the actual storage consumption will be very small and won’t burden the memory. For example:
A | B | |
1 | $ select NAME from CITIES | 15000000 |
2 | =A1.(NAME) | =A2.len() |
3 | 2000-1-1 | =create(ID,Name,City,Birthday) |
4 | for B1 | >B3.insert(0,#A4,”N”/#A4,A2(rand(B2)+1),after@d(A3,-rand(10000))) |
5 | =B3 | =B3 |
6 | =B3 | =B3 |
7 | >testTable=B3 | >testTable.to(-10).run(Name=”Change”) |
In this cellset program, all variables of B3, A5, B5, A6 and B6, as well as the cellset variable testTable contain a table sequence that has 15,000,000 records. But as they store the same data object there is no risk of memory overflow during the computation. When B7 modifies the last ten records stored in testTable, the table sequences stored in B3, A5, B5, A6 and B6 are also modified at the same time because they are actually the same object.
2. Database connections
Generally a database puts a limit on the number of connections to it. A number exceeding the limit will make the effort of establishing a new connection fail. Even if a database has no such restriction, too many connections will slow down the execution of database query and affect the efficiency.
In the following the MySQL database is used to make a test in which the maximum number of database connections is set as 10:
max_connections=10
The setting can be viewed in MySQL’s console:
max_connections, the maximum number of connections the database allows, has been modified. The default maximum number of connections MySQL allows is over 100.
Below is the esProc test code for adding database connections:
A | B | C | |
1 | connect(“mysql”) | ||
2 | for 20 | =”arg”/A2/”=”/B1 | =eval(B2) |
A2 runs a loop, and B2 composes a statement to establish a connection to MySQL database with each loop. With the first execution, B2 generates the following expression:
For each loop C2 executes B2’s expression with the eval function to establish a database connection and name it arg1, arg2, arg3,… The testing aims simply to establish the connections without querying data through them.
But soon an error arises:
The error appears because the number of connections exceeds the maximum number of connections. A view of the MySQL console tells a clear story:
The number of current connections, Threads_connected, is greater than the maximum number of connecitons, so an extra database connection fails. The database connection in esProc can’t close automatically to release resource unless the esProc program exits:
The remaining connection in Threads_connected is the console connection.
To completely remove the database connections, call the db.close() function. For example:
A | B | C | |
1 | connect(“mysql”) | ||
2 | for 20 | =”arg”/A2/”=”/B1 | =eval(B2) |
3 | =”arg”/A2/”.close()” | =eval(B3) |
In the above cellset code, the database connection established in C2 is closed in C3. This way the program can be executed normally. The connection status can be viewd from the MySQL console:
Therefore, the database connection established in esProc should be closed using the db.close() function as soon as it has done its job so that resource can be released timely. Alternatively, you can effectively control the number of database connections by using the default connection to query the database.
3. Cursors
Cursors are commonly used in esProc to handle big data through batch processing.
Within one database connection, only one database cursor is allowed, as the following program shows:
A | B | C | |
1 | =connect(“mysql”) | =A1.cursor(“select * from States”) | =A1.cursor(“select * from States”) |
2 | >A1.close() |
In this cellset program, an error is reported when C1 tries to create a second cursor:
Similar to a database connection, a database cursor should be specifically closed using the cs.close() function to release the database resource it occupies after it completes its mission. For example:
A | B | C | |
1 | =connect(“mysql”) | ||
2 | =A1.cursor(“select * from States”) | =A2.fetch(5) | >A2.close() |
3 | =A1.cursor(“select * from States”) | =A3.fetch(5) | >A3.close() |
4 | >A1.close() |
Since A2’s cursor is closed in C2 after desired data is fetched from it, A3’s cursor is able to work normally.
Note that when a cursor’s data is entirely fetched out, the cursor will close automatically. For example:
A | B | C | |
1 | =connect(“mysql”) | ||
2 | =A1.cursor(“select * from States”) | =A2.fetch() | |
3 | =A1.cursor(“select * from States”) | =A3.fetch() | |
4 | >A1.close() |
This cellset program can be executed normally because B2 will fetch all data from the cursor with the fetch function and then A2’s cursor will close automatically.
However, there’s no such a limit on database cursors as on file cursors. esProc allows multiple cursors created based on one file and data retrieval from them won’t affect each other. For example:
A | B | C | |
1 | =file(“PersonnelInfo”) | ||
2 | =A1.cursor@b() | =A2.fetch(5) | |
3 | =A1.cursor@b() | =A3.fetch(5) |
Since a file cursor uses very small resource, it is unnecessary to close it during the computation.
But if new data is written to a file, then the multiple cursors generated based on it will have mutual interference between them. More information about file objects will be covered in the next section.
4. File objects
According to the previous section, there’s no need to close a file cursor because a file object won’t consume database connection resource as a database cursor does. Besides, multiple processes are allowed to retrieve data from the file object. But when additional data is written into the file, the result of data retrieval in a same way will be different. For example:
A | B | C | |
1 | =file(“test”) | =demo.query(“select * from EMPLOYEE”).(NAME/” “/SURNAME) | =create(ID,Name) |
2 | for 10000 | =rand(500)+1 | >C1.insert(0,A2,B1(B2)) |
3 | >A1.export@z(C1) | =A1.import@b(;2:5) | |
4 | for 10000 | =rand(500)+1 | >C1.insert(0,A4+10000,B1(B4)) |
5 | >A1.export@z(C1) | =A1.import@b(;2:5) |
A3 and A5 respectively export data to a binary file test. Both B3 and B5 import the second segment of the same file by dividing the file into five segments, but B5’s result is different from that of B3 due to the updating of the data in the file:
It’s not necessary to close a file object during the computation as it consumes very small resource. The resource it uses will be freed after the cellset program exits.
It’s a different situation when a file cursor is used:
A | B | C | |
1 | =file(“test”) | =demo.query(“select * from EMPLOYEE”).(NAME/” “/SURNAME) | =create(ID,Name) |
2 | for 10000 | =rand(500)+1 | >C1.insert(0,A2,B1(B2)) |
3 | >A1.export@z(C1) | =A1.cursor@b(;2:5) | =A1.cursor@b(;2:5) |
4 | =B3.fetch(10) | =C3.fetch(10) | |
5 | for 10000 | =rand(500)+1 | >C1.insert(0,A4+10000,B1(B4)) |
6 | >A1.export@z(C1) | =A1.import@b() |
B3 and C3 respectively create a cursor to retrieve data from the file. With open file cursors, the source file refuses data to be written into it with the export function. After the code in line 5 is executed, the number of records in C1’s table sequence increases to 20,000. But with open cursors in B3 and C3, A6’s export statement can’t actually be executed. So the number of all records imported from the file by B6 is still 10,000.
5. Resource release in cellset invocation
Besides executing a single dfx file, esProc also supports calling another cellset program using call function or callx function. Here involves the rules of resource release during the invocation of the cellset program.
Below is a cellset progam CountSub.dfx:
A | B | |
1 | $ select NAME from CITIES | 15000000 |
2 | =A1.(NAME) | =A2.len() |
3 | 2000-1-1 | =create(ID,Name,City,Birthday) |
4 | for B1 | >B3.insert(0,#A4,”N”/#A4,A2(rand(B2)+1),after@d(A3,-rand(10000))) |
5 | =B3.groups(year(Birthday):Year;count(~):Count) | return A5 |
The first four lines create a table sequence containing 15,000,000 records and store it in cell B3, the variable. A5 calculates the number of people born in each year based on B3’s table sequence. B5 returns the aggregate result to the main program, without first clearing the table sequence from B3.
This cellset program is then called in the main program:
A | |
1 | =10.(call(“CountSub.dfx”)) |
2 | =A1.conj().groups(Year;sum(Count):Count) |
A1 calls the subroutine CountSub.dfx ten times. A2 concatenates the aggregate results to get the count of the people born in each year for the 150,000,000 random records. Here are results of A1 and A2:
The main program runs smoothly without memory overflow. This means the variable is cleared from the cellset program as soon as the invocation of it is completed. For an executed subroutine, its variable will be regarded as useless and automatically cleared to avoid memory consumption, without specifically setting the variable as null. In the meantime, the data returned to the main program from the subroutine is still available. But if the subroutine CountSub.dfx returns the B3’s source table sequence instead of A5’s aggregate result, then the memory will be unable to hold so much data, leading to overflow.
Here’s another cellset program QuerySub.dfx:
A | |
1 | =connect(“mysql”) |
2 | =A1.query(“select * from States where left(Capital,1) = ?”,arg1) |
3 | return A2 |
A2 establishes a connection to MySQL database for querying records of states whose capitals begin with the specified letter. The program uses a parameter:
It is then called by the main program:
A | |
1 | =20.(call(“QuerySub.dfx”,char(64+#))) |
2 | =A1.conj() |
Before the invocation, you can view the conncetion status through MySQL console:
An error occurs during the execution of the main program:
Because the current number of connections exceeds the maximum number allowed, as can be seen from the console:
During the invocation of the cellset program, the database connections in it won’t automatically close. To solve the issue, make some modification on QuerySub.dfx, where the database connections will be closed after the query is accomplished.
A | |
1 | =connect(“mysql”) |
2 | =A1.query(“select * from States where left(Capital,1) = ?”,arg1) |
3 | >A1.close() |
4 | return A2 |
Now restart MySQL database to execute the main program. This time the computation performs well. Here’s A2’s result in the main program:
Below is the connection status that MySQL console displays:
Since the database connections in the subroutine are closed once they’ve done their jobs, Max_used_connections only increases by 1 while 20 database connections are added during the execution of the main program.
Apart from closing database connections timely in the subroutine, an efficient way of using database connection is through default database connection, like the the database configured in the data source manager and the one connected by default on the server-side.
The following subroutine DefQuerySub.dfx queries data through a default database:
A | |
1 | =mysql.query(“select * from States where left(Capital,1) = ?”,arg1) |
2 | return A1 |
A1 queries the default MySQL database to get the records where the capitals begin with the specified letter. A2 retruns the result to the main program:
The subroutine is then called by the main program:
A | |
1 | =20.(call(“DefQuerySub.dfx”,char(64+#))) |
2 | =A1.conj() |
In this way the main program works as well and gets the same result:
Here’s the connection status on the MySQL console:
Compared with the previous connection status, only 1 connection is added and Threads_connected is 2 because the default conncetion to MySQL is still on. Therefore the dabase connection can be effectively used with the default connection during a query, reducing the uncessary resource consumption.
The following subroutine QueryOneSub.dfx uses a cursor to query data but doesn’t close it after the query finishes:
A | |
1 | =mysql.cursor(“select * from States where left(Capital,1) = ?”,arg1) |
2 | =A1.fetch(1) |
3 | return A2 |
The subroutine is called in the main program as follows:
A | |
1 | =20.(call(“QueryOneSub.dfx”,char(64+#))) |
2 | =A1.conj() |
But an error is reported during the execution:
This is because each database connection only allows one database cursor, which causes resource consumption problem if it isn’t closed in time.
The problem can be sovled by modifying the subroutine QueryOneSub.dfx:
A | |
1 | =mysql.cursor(“select * from States where left(Capital,1) = ?”,arg1) |
2 | =A1.fetch(1) |
3 | >A1.close() |
4 | return A2 |
Alternatively, you can make the subroutine return a database cursor directly but let the main program conroll the query, as shown by the cellset program QueryCursorSub.dfx:
A | |
1 | =mysql.cursor(“select * from States where left(Capital,1) = ?”,arg1) |
2 | return A1 |
It is called in the main program like this:
A | B | C | |
1 | [] | ||
2 | for 20 | =call(“testCellValueMemSub4.dfx”,char(64+A2)) | |
3 | >A1=A1|B2.fetch(1) | >B2.close() |
A2 runs a loop in which for each circle B2 obtains the cursor returned from the subroutine, B3 feches the first-found record from the cursor and C3 closes the cursor. With an open cursor that uses certain resource, error will arise when the subroutine is called next time to generate a cursor.
A main program can operate normally if it closes the cursor each time the data retrievel is accomplished. After execution A1’s result is as follows:
There are some differences between the uses of file objects and those of the database connection and the cursor. A file object occupies very small memory space and won’t consume any database resource, so it’s no need to close it to release system resource. But, similar to a database cursor, a file cursor should be closed after it has done its job; otherwise the export of data to it could fail.