How esProc Implements Text Processing

Blog 1806 0

Encapsulated lots of functions for handling structured file computing, esProc can import text files with complex formats, carry out cursor-style processing with big files and simplify multithreaded parallel processing. Usually there are three modes in which esProc can be applied: a standalone mode, the execution from command line and Java integration, as shown below:

esProc_text_application_1

The standalone mode

The text file sOrder.txt, separated by tabs, stores several years of order data, based on which you need to calculate the link relative ratio of each month’s sales amount in a specified time period. Below is a selection of source data:

esProc_text_application_2

First you can implement the algorithm in esProc IDE.

  A
1 =file(“D:\\sOrder.txt”).import@t(OrderID, Amount, OrderDate)
2 =A1.select(OrderDate>=startDate && OrderDate<=endDate)
3 =A2.groups(year(OrderDate):y,month(OrderDate):m;sum(Amount):mAmount)
4 =A3.derive(mAmount/mAmount[-1]:lrr)

Retrieve desired fields to form a two-dimensional table (a table sequence); get data according to a time period, in which begin and end are query parameters; then group data by years and months, and aggregate Amount; finally add a new field Irr (the link relative ratio by month) to A3. In the new field expression, mAmount represents sales amount of the current month and mAmount[-1] represents that of the previous month.

The execution of the finished script requires assigning values to the parameters. The interface is as follows:

esProc_text_application_4

Click “OK” to complete the computation. Then you can see result of each step by clicking the corresponding cell. For instance, below is A3’s result:

esProc_text_application_5

A4 stores the final result, as shown below:

esProc_text_application_6

Apart from executing all cells sequentially based on the code as the above shows, esProc provides a shortcut key for executing the current cell conveniently in the interactive computation.

Execution from command line

You can use esprocx.exe to execute scripts, in a form like esprocx -R d:\Query1.dfx 2012-06-01 2014-06-30. -R option can display the computing result, the last cell by default. This is shown below:

esProc_text_application_7

See esProc Integration & Application: Command Line for details.

Java integration
The Java main program can invoke the esProc script via JDBC, using the following code:

   Class.forName(“com.esproc.jdbc.InternalDriver”);

   con= DriverManager.getConnection(“jdbc:esproc:local://”);

   // call esProc script (which is similar to the stored procedure); Query1 is the name of the dfx file

   st =(com. esproc.jdbc.InternalCStatement)con.prepareCall(“call Query1 (?,?)”);

   java.util.Date  dateBegin  =  new SimpleDateFormat(“yyyy-MM-dd”).parse(“2010-01-01”);

   java.sql.Date  sqlDateBegin  =  new java.sql.Date(dateBegin.getTime());

   java.util.Date  dateEnd  =  new SimpleDateFormat(“yyyy-MM-dd”).parse(“2010-12-31”);

   java.sql.Date  sqlDateEnd  =  new java.sql.Date(dateEnd  .getTime());

   st.setDate(1, sqlDateBegin);

   st.setDate(2, sqlDateEnd );

   // Execute the script

   st.execute();

   // Get the result set

   ResultSet rs = st.getResultSet();

   ……

esProc will return the last calculation cell by default, or it will return a certain cell using the return statement. The returned result is a ResultSet object in accordance with JDBC standards. The method of calling an esProc script is the same as that of accessing a database. Programmers who are familiar with JDBC can master it quickly.

All the reporting tools that support JDBC interface can invoke an esProc script. For details, see How to use esProc to assist reporting tools.

The above is the normal way of integrating an esProc script by Java. Now let’s move on to look at the special cases:

Simple script without a file

A simple esProc script can be written directly in the Java application, without the need of creating a script file.

For instance, you can use the following code to integrate the script for querying a text file into Java:

  st = (com. esproc.jdbc.InternalCStatement)con.createStatement();

  ResultSet rs1 = st.executeQuery(“=file(\”D:\\sOrder.txt\”).import@t()\n” + “=A1.select(OrderDate>=date(\”2010-01-01\”) && OrderDate<=date(\”2010-12-31\”))”);

You just need to use the carriage return “\n” to separate lines (and “\t” to separate columns).

You can also use the prepareStatementment object to execute the script, forcing a parameter type conversion. The SQL prepareStatementment object uses the quotation mark as the parameter placeholder. But as the quotation mark is a part of an esProc expression, you need to use the form of “arg1,arg2,arg3” to hold places sequentially. Here’s the code:

  st= (com. esproc.jdbc.InternalCStatement)con.prepareStatement(“=file(\”D:\\sOrder.txt\”).import@t()\n” + “=A1.select(OrderDate>=arg1 && OrderDate<=arg2)”);

  java.util.Date  dateBegin  =  new SimpleDateFormat(“yyyy-MM-dd”).parse(“2010-01-01”);    

  java.sql.Date  sqlDateBegin  =  new java.sql.Date(dateBegin.getTime());

  java.util.Date  dateEnd  =  new SimpleDateFormat(“yyyy-MM-dd”).parse(“2010-12-31”);    

  java.sql.Date  sqlDateEnd  =  new java.sql.Date(dateEnd  .getTime());

  st.setDate(1, sqlDateBegin); 

  st.setDate(2, sqlDateEnd ); 

  ResultSet rs1 = st.executeQuery(); 

Writing data into a database

esProc supports computations involving heterogeneous data sources. For example, to update an Oracle database with A4 in the above example, you just need to write the following code in A5: >orDB1.update(A4,summaryTable, Year:y,Month:m,mAmount,lrr; Year,Month) .

orDB1 is a data source pointing to Oracle. You can use the update function to update the summaryTable table in orDB1 with A4’s table sequence. In the table, the fields Year and Month correspond A4’s y and m respectively; and mAmount and Irr fields have counterparts of the same names in A4.

In this case, the code of integration by Java doesn’t need st.getResultSet() to get the result.

About database computing, see How esProc Assists Writing SQL Queries.

Big result set

Sometimes the computing result is too big to be held in the memory. In that case you can use an esProc cursor function to return it, and use JDBC flow to access Java. For example, file1.txt and file2.txt are big files with the same structure but they store different data, and you need to find records from them with same IDs.

esProc code:

  A B
1 =file(“E:\\file1.txt”).cursor() =file(“E:\\file2.txt”).cursor()
2 =[A1.sortx(_1),B1.sortx(_1)].merge@xi(_1)  

The cursor function can execute a SQL statement and return a cursor. The sortx function can sort the cursor. The merge function can merge ordered data; its @x means merging cursors and @i means returning an intersection of the cursors.

A2 also returns a cursor, from which the Java main program cannot fetch data at one time. So you need to perform a batch fetching with the following code:

   st =(com. esproc.jdbc.InternalCStatement)con.prepareCall(“call bigMerge()”);

   st.setFetchSize(1000);// Set the number of records retrieved in each batch

   st.execute();

   ResultSet rs = st.getResultSet();

   while (rs.next()) {

   ……

   }

You can export A2 to a file by writing =file(“E:\\result.txt”).export(A2) in A3. Using a similar method, you can also export data from a database to a file.

FAVOR (0)
Leave a Reply
Cancel
Icon

Hi,You need to fill in the Username and Email!

  • Username (*)
  • Email (*)
  • Website