The New Java Way of Handling Structured Text Files

It’s common to handle structured text files in Java programming. But since Java lacks corresponding class library, code produced through hardcoding is complicated and difficult to maintain.

For example, sales.txt is a tab-separated structured text file that needs sorting by SellerId. Here’s the code:

Comparator<salesRecord> comparator = new Comparator<salesRecord>() {

public int compare(salesRecord s1, salesRecord s2) {

if (!s1.SellerId.equals(s2.SellerId)) {

return s1.SellerId.compareTo(s2.SellerId);

} else {

return s1.ID.compareTo(s2.ID);

}

};

Collections.sort(sales, comparator);

This piece of code focuses merely on the sorting algorithm. If you fill in more steps including file retrieval, skipping the first line of field names, constructing salesRecord class and exporting result set, the code will become really lengthy. This piece can be only used to sort data by a single field. To sort data by multiple fields or in both ascending and descending directions, you need to redesign the code.

Besides sorting algorithm, other basic algorithms of handling structured text files include data querying, grouping and aggregation, merge operation, finding unique values and join operations, etc. Usually it takes the combined effort of more than one basic algorithm to implement complete business logic. In this case, the hardcode becomes more complicated and much less readable. For large file handling involving file segmentation, data exchange between internal and external memory, and parallel algorithm, code writing will be extremely difficult.

Using computing layer specifically designed to deal with structured data will help you create particularly succinct code. Here we use esProc to explain this point. With esProc, you are able to express the above sorting algorithm as follows (file retrieval is included):

Connection con= DriverManager.getConnection(“jdbc:esproc:local://”);Statementst = (com. esproc.jdbc.InternalCStatement)con.createStatement();

ResultSet rs = st.executeQuery(“=file(\”d:\\Sales.txt\”).import@t().sort(SellerId)”);

The import function reads in the file with @t added to use the first line as column names. The sort function sorts data. This esProc script can be embedded in a Java main program, which obtains the sorting result through JDBC interface offered by esProc.

If the computation is complex or involves a lot of steps – writing the sorting result into an export file, for instance – you can create the esProc script as a separate file ( named, say, salesSort.dfx):

	A
1	=file(“D:\\sales.txt”).import@t()
2	=A1.sort(SellerId)
3	=file(“D:\\result.txt”).export@t(A2)

Then call this script from the Java program in the way of calling a stored procedure, as shown below:

Statementst st =(com. esproc.jdbc.InternalCStatement)con.prepareCall(“call salesSort()”);st.execute();

To learn more details (such as parameter passing) about calling esProc by the Java program, see How to Use esProc as the Class Library for Java.

esProc is only one of the many kinds of computing layers. In a nutshell, a computing layer is a special class library or interface sitting between the source data (including structured text files) and the Java main program to specifically handle structured data. It aims at simplifying code and increasing efficiency by avoiding reinventing the wheel. It has a series of features: rich basic algorithms, convenient binary operations, simplified large file handling and intuitive multi-step computing style.

Rich basic algorithms

A structured data computing layer encapsulates a variety of basic algorithms into simple functions, which are easy to call and has efficient execution.

Data sorting:

The first example shows how to write code for single field sorting. If you sort a file on more than one field (SellerId and Client, for example), you can do it in this way:

A1.sort(SellerId,Client)

To perform a sort by SellerId in ascending order and by the year in descending order, use:

A1.sort(SellerId,year(OrderDate):-1)

Data querying:

To find records where Amount>=5000, use:

A1.select(Amount>=5000)

To perform boolean queries on multiple criteria, such as finding records during a certain time period defined by parameters startDate and endDate, use:

A1.select(OrderDate>=startDate && OrderDate<=endDate)

Below is a query with free search criteria, in which condition can be any conditional statement conforming to syntax. It might be year(OrderDate)==2015, or (year(OrderDate)<=2015 && value>=5000 )||year(OrderDate)>2015.

A1.select(${condition})

If records are ordered, use the binary search, along with @b option, to increase efficiency:

A1.select@b(Amount>=5000)

Grouping and aggregation

To group data by SellerId and year, and sum the Amount values and count the records for each group, use:

A1.groups(SellerId,year(OrderDate);sum(Amount),count(~))

To compare only neighboring records without a pre-sorting before grouping, you can use @o option.

Getting unique values

To create a client list, use:

A1.id(Client)

The id function is the counterpart of SQL DISTINCT function.

Removing duplicates

Use the following statement to retain only the first record for each salesperson’s each client. It uses group@1 function in which @1 means getting the first record of every group.

A1.group@1(Client,SellerId)

Finding topN

To find out the top 3 orders of each salesperson in terms of sales amount, use:

A1.group(SellerId;~.top(3,-Amount):t).conj(t)

The top function selects the top N records with the sign “-” meaning a reverse order. The conj function concatenates records.

To find the order with the largest amount, use maxp function.

Convenient binary operations

It’s already complicated enough to write code for performing a unary operation, yet the difficulty index further increases in carrying out a binary operation or a multiple operation. By providing convenient to use binary functions and operators, a structured data computing layer can noticeably enhance efficiency.

Related computing
emp.txt is a tab-separated text file whose EId field corresponds to sales.txt’s SellerId field. The task is to add and align emp.txt’s Name, Dept and Gender fields to sales.txt using Java. Here’s the code:

3	=join@1(sales:s,SellerId;emp:e,EId)
4	=A3.new(s.OrderID, s.Client, s.SellerId, s.Amount, s.OrderDate,e.Name, e.Dept, e.Gender)

The join function performs join operations and changes the table names to s and e. It performs inner join by default but uses @1 to do a left join and @f to get a full join. After the joining, get the desired fields from the resulting table to form a new two-dimensional structured table. Here’s the result:

esProc_java_structure_text_1

Merge operation

Table1.txt and Table2.txt are already sorted by the logical primary key – field A and field B. The task is to update Table1 with Table2 according to the primary key. Specific rule is that you update Table1 when the key values are identical but other field values are different, and you insert the record into it when the key values are different.

The code:

	A	B
1	=file(“D:\\table1.txt”).cursor@t()	=file(“D:\\table2.txt”).cursor@t()
2	=[B1,A1].merge@xu(A,B)

Retrieve Table1.txt and Table2.txt respectively as a cursor, and update A1 using B1 according to their logical primary keys. The merge function combines data together and keeps the result ordered. @x means cursor manipulation, @u means getting union. Then write the result set to a new file.

With the use of cursor, you can handle files that have any amount of data, including those extremely large files of course.

If the files are originally not ordered, they need to be sorted before being merged. In that case, A2 can be written as follows:

[B1.sortx(A,B),A1.sortx(A,B)].merge@xu(A,B)

Set operations

The f1.txt file and f2.txt file have the same data structures with the first line being the column names. The task is to find the intersection of the Name fields of both files. Here’s the code:

	A	B
1	=file(“E:\\f1.txt”).import@t()	=file(“E:\\f2.txt”).import@t()
2	=[A1.(Name),B1.(Name)].isect()

The isect function gets the intersection of sets. A1.(Name) means retrieving A1’s Name field and B1.(Name) means retrieving B1’s Name field. Both generate a set.

Also, there are union function for getting union, diff function for finding out difference, and conj function for calculating concatenation (equal to union all). These functions can be replaced by the operators ^, &, \ and|- intersection, union, difference and concatenation respectively. To get intersection, for example, use:

A1.(Name) ^ B1.(Name)

To perform intersection operation on rows, just import the text files row by row:

	A	B
1	=file(“D:\\f1.txt”).import@tis()	=file(“D:\\f2.txt”).import@tis()
2	=A1^B1

If files are already sorted, you can use merge algorithm to achieve the result that set operations can get in order to increase performance. Here’s the code:

	A	B
1	=file(“D:\\f1.txt”).import@t()	=file(“D:\\f2.txt”).import@t()
2	=[B1,A1].merge@i(Name,Dept)

The merge function combines data together, with @i used for intersection, @u for union and @d for difference

Simplified large file handling

In view of the memory capacity and performance requirement, you should import a large file and perform data manipulation in parts. This, therefore, involves the exchange of data between internal and external memory. In an effort to achieve even higher efficiency, you’d better retrieve the file by bytes using multithreaded parallel processing. By doing so, data importing doesn’t necessarily cover complete rows for each part. Thus the problem of incomplete records arises. It’s difficult to amend the situation involving detail work through hardcoding, but a structured data computing layer can handle this in a simple way.

Data querying

Query a large text file according to a given time period. Here’s the code:

	A
1	=file(“d:\\sales.txt”).cursor@t()
2	=A1.select(OrderDate>=startDate && OrderDate<=endDate)

Open the file whose size is larger than the available memory size in the form of the cursor; then use the select function to query data and return result set as a cursor. The Java main program can access the ResultSet in the JDBC stream style.

To improve the performance, use the following parallel code:

	A	B
1	4
2	fork to(A1)	=file(“d:\\sales.txt”).cursor@t(;A2:A1)
3		=B2.select(OrderDate>=startDate && OrderDate<=endDate).fetch()
4	=A2.conj()

Set 4 parallel threads. Each will retrieve one part of the file, perform the query and return result to the main program that is responsible for concatenating the returned result sets and returning the final result to Java via JDBC. Here the select function performs the query on cursor and returns cursor too. Usually the query result set is small enough that can be retrieved into the memory with fetch function.

The cursor@z function divides the file by bytes into 4 parts, which have approximately the same sizes, and retrieves only one part. By giving up the head line and complementing the tail line, esProc is capable of ensuring that each part is retrieved with complete rows.

Data sorting

To sort a large file, open it with the cursor and use sortx function to do it. The result is still a cursor: Below is the code:

A1.sortx(SellerId,Client)

The usage of sortx function is similar to that of sort function. It is capable of performing a sort in reverse direction or by multiple fields or according to both.

Grouping and aggregation

Usually the result set of grouping and aggregation is not big, so you just need to open the large file with the cursor and then use the groups function to perform the group operation. If the result set is large, use groupx function to return it as the cursor:

A1.groupx(SellerId,year(OrderDate);sum(Amount),count(~))

To have a better performance, use the following parallel code:

	A	B
1	4
2	fork to(A1)	=file(“/files/data/sales.txt”).cursor@t(;A2:A1)
3		=B2.groups(SellerId:s,year(OrderDate):y; sum(Amount):a,count(~):c)
4	=A2.conj()
5	=A4.groups(s,y;sum(a):atotal,sum(c):ctotal)

Set the number of parallel threads as 4. Each thread retrieves one segment of the large file, performs grouping and aggregation and returns result to the main program, which concatenates all the returned result sets and performs a second grouping and aggregation.

Getting unique values

To create a client list based on sales.txt, use the following code:

	A
1	=file(“D: \\sales.txt”).cursor@mt(Client)
2	=A1.(Client).groups(~)

Open the file in the form of parallel cursors and retrieve only the Client field which will be grouped in the next step. @m means parallel processing. When performing group operation on the grouping field, the groups function will return a set of values of the grouping field.

Here we assume that the result set is relatively small. If the result set is sufficiently large, you can use groupx function to return it as the cursor:

A2=A1.(Client).groupx(~)

Intuitive multi-step computing style

It’s a common practice to use several basic algorithms in a certain order for the same task. Using hardcoding, it is easy to get bogged down in detail algorithms. What’s more, you have to worry about the performance and code reuse, making it difficult to design an algorithm in a natural, top-down way of thinking. A computing layer, however, already encapsulates the basic algorithms into easy to use functions. Programmers just write an algorithm intuitively by focusing on the business logic, delivering far better development efficiency.

Here’s an example. The Stock.txt file holds records of warehouse management. In one day, the stock of a product may be added or removed repeatedly. Also there may be consecutive days when no product is admitted or checked out. The initial quantity of a product is 0, and its increase is In and its decrease is Out. The task is to calculate the daily inventory for every product. Below is the source data：

date name quantity flag2014-04-01 Item1 15 In

2014-04-01 Item1 4 In

2014-04-02 Item1 3 In

2014-04-02 Item1 10 Out

2014-04-03 Item1 3 In

2014-04-04 Item1 5 Out

2014-04-07 Item1 4 In

2014-04-10 Item1 2 Out

2014-04-01 Item2 20 In

2014-04-02 Item3 30 In

2014-04-03 Item3 14 Out

The code:

	A	B
1	=file(“D:\\stock.txt”).cursor@t()
2	=A1.group(name,date;~.select(flag==”In”).sum(quantity):in,~.select(flag==”Out”).sum(quantity):out)
3	=A2.group(name)	=periods((t=A2.id(date)).min(),t.max(),1)
4	for A3	=A4.align(B3,date)
5		>c=0
6		=B4.new(A4.name:name,B3(#):date,c:Opening,in,(b=c+in):Total,out,(c=b-out):Close)
7		=@\|B5

A2 calculates the daily quantities of In and Out respectively for every product. Then get the complete list of dates according to the smallest and biggest dates and store it in B3. Next, group the data by product, loop through each group and align it with B3 to calculate the inventory of the current day. Here’s the result:

esProc_java_structure_text_2

Rich basic algorithms

Convenient binary operations

Simplified large file handling

Intuitive multi-step computing style

Hi，You need to fill in the Username and Email！