How esProc Assists Java to Query Big Text Files

Sometimes you need to query a big text file, instead of the database. In those cases, you need to retrieve the file in a stream style to perform the query algorithm, in which parallel processing is needed to improve the computing performance. As Java lacks the class library for doing these, you have to use hardcoding to handle the processing of structured data, and, as a result, the code is complicated, unreadable, and inefficient in performing parallel processing.

You can use esProc (free edition is now available) to make up for what Java lacks. Encapsulated with rich functions for reading and writing structured data and cursor functions, esProc handles parallel processing with simple code. It also provides the easy-to-use JDBC interface. The Java application can identify an esProc script as the database to execute, pass parameters to it and get the result set via JDBC. You can learn more details from How to Use esProc as the Class Library for Java.

Here’s an example for explaining the process of how esProc helps Java in querying the big text file. Below is the source data:

esProc_java_bigtextfile_1

To query orders whose dates are between startDate and endDate and whose amounts are greater than argAmount, use the following code:

	A
1	=file(“D:\\sOrder.txt”).cursor@t()
2	=A1.select(OrderDate>=startDate && OrderDate<=endDate && Amount>argAmt)
3	=A2.fetch()

Open the file as a cursor with the cursor function; @t means importing the first row as column names. Then perform structured query, and fetch data from the query result into the memory if the result is not big. The result is as follows:

esProc_java_bigtextfile_3

If the memory cannot hold the query result, you can return a cursor directly from the esProc script (i.e. delete A3’s code). And then the Java application can fetch data from the returned cursor in JDBC stream style.

esProc also supports multithreaded parallel processing. The simplest way is using @m option with cursor function in the preceding code. The option means retrieving file with multiple threads.

Or you can segment the file manually to use multiple threads in both data retrieval and data computing. The code is as follows

	A
1	=8.(file(“D:\\sOrder.txt”).cursor@zt(;,~:8))
2	=A1.(~.select(OrderDate>=startDate && OrderDate<=endDate&& Amount>argAmt))
3	= A2.conj@xm()

It opens the file with 8 cursors, each retrieving a specified part of the file. @z means dividing the file roughly into multiple segments by bytes and retrieving one of them each time. esProc will automatically skip the head row and make up the tail row to ensure that each row is retrieved completely.

The conj function can merge the results. @x means the objects of merge operation are cursors; @m means performing parallel computing. Note that the function cannot guarantee the order consistency of the records in the result set and the original file.

The preceding code uses the esProc built-in function for parallel processing. If the algorithm is complicated and there is enough memory space for holding the result set, it is better to use explicit parallel statement. The code is as follows:

	A	B
1	=8
2	fork to(A1)	=file(“d:\\sOrder.txt”).cursor@z@t(;,A2:A1)
3		result B2.select(OrderDate>=startDate && OrderDate<=endDate&& Amount>argAmt).fetch()
4	=A2.conj()

It uses 8 parallel threads to retrieve and process the big file, and each return the result to the main program after the query is done. The fork function works to execute these threads. Its working range is B2-B3, within which you can use A2 to get the entry parameter and outside which you can use A2 to get the results of all the threads.

For ordered data, you can use binary search to increase the query performance. For example, data has been sorted by Client and OrderID, you need to find corresponding records according to parameters argClient and argOrder. To do this, use the following code:

	A	B	C
1	=file(“:\\sOrder.txt”)	=[argClient,argOrder]
2	=begin=0	=end=A1.size()
3	for begin<end-1	=m=(begin+end)\2
4		=A1.cursor(;,m).fetch@x(1)	=[B4.Client,B4.OrderID]
5		if B2==C4	=B4
6			break
7		elseif B2>C4	=begin=m
8		else	=end=m
9	result C5

begin and end are beginning position and ending position specified for the binary search. m is the middle position.

Locate the middle position by bytes and retrieve a record with cursor. esProc will automatically skip the head row and make up the tail row to retrieve a complete row. If the locating is successful, store the current record in C5. If it fails, continue to compare the related sets and reset begin and end.

Hi，You need to fill in the Username and Email！