Structured-Data Computing Layer for Java

Blog 2098 0

Programmers can handle structured data using SQL statements when the data is stored in a single database. But they can’t use SQL statements directly when the data is held in a non-database such as the file, the NoSQL database, or the Json data stream, or when multiple databases are involved. As an alternative, programmers turn to hardcoding to perform the computation with Java basic class libraries.

Here’s an example. Import the sales records from a file to calculate the sales amount of each salesperson. The computation involves grouping and aggregation algorithms. You might hardcode the computation like this:

       Comparator<salesRecord> comparator = new Comparator<salesRecord>() {
           public int compare(salesRecord s1, salesRecord s2) {
              if (!s1.salesman.equals(s2.salesman)) {
                  return s1.salesman.compareTo(s2.salesman);
              } else {
                  return s1.ID.compareTo(s2.ID);
              }
           }
       };
       Collections.sort(sales, comparator);
       ArrayList<resultRecord> result=new ArrayList<resultRecord>();
       salesRecord standard=sales.get(0);
       float sumValue=standard.value;
       for(int i = 1;i < sales.size(); i ++){
           salesRecord rd=sales.get(i);
           if(rd.salesman.equals(standard.salesman)){
              sumValue=sumValue+rd.value;
           }else{
              result.add(new resultRecord(standard.salesman,sumValue));
              standard=rd;
              sumValue=standard.value;
           }
       }
       result.add(new resultRecord(standard.salesman,sumValue));
       return result;

You see that the hard coding has a number of drawbacks in manipulating data in non-databases or across multiple databases: basic class libraries can’t express the algorithm for structured data handling; program development is low-efficiency and time-consuming; the code is too long for easy maintenance and is too tightly coupled to be reused.

A structured-data computing layer is an ideal solution to handling non-database data and data stored in multiple databases. Any class library for structured data handling or interface can be used as the computing layer. Below is the esProc code for expressing the same algorithm the above hard code expresses:

Class.forName(“com.esproc.jdbc.InternalDriver”);

Connection con= DriverManager.getConnection(“jdbc:esproc:local://”);

Statementst = (com. esproc.jdbc.InternalCStatement)con.createStatement();

ResultSet rs = st.executeQuery(“=file(“d:/source.xlsx”).import@t().groups(salesman;sum(value))”);

Or you can save the script as a file named summary.dfx:

  A
1 =file(“d:/source.txt”).import@t().groups(salesman;sum(value))

Then call it from the Java application in a way Java calls the stored procedure:

Class.forName(“com.esproc.jdbc.InternalDriver”);

Connection con= DriverManager.getConnection(“jdbc:esproc:local://”);

Statementst st =(com. esproc.jdbc.InternalCStatement)con.prepareCall(“call summary()”);

st.execute();

ResultSet rs = st.getResultSet();

So the structured-data computing layer sitting between the source data and the Java application is a kind of class library or interface specifically responsible for manipulating structured data. Below shows the architecture between the three parts:

esProc_tech_structure_data_1

Compared to hardcoding, a structured-data computing layer has the following merits:

Convenient text handling

Text data is the most common type of non-database source data, including text, csv and log files. Text handling involves importing and exporting, column and row accessing and maintenance, composite query, group and aggregate operations, set-oriented operations, merge operation and join operations, and so on. There will be a lot of difficulties when using hardcoding for text handling, such as, whether the first line is the column headers, how to perform the composite query by certain column names, whether we should use the binary approach or the bubble sort for data sorting, and how to use parameters to replace the aggregation algorithm.

Both text files and database files are two-dimensional structured data in the eye of the computing layer, so the process of realizing an algorithm by the computing layer should be expected to be simple, well-thought-out and efficient. Here are examples to give you clues about the strength of a computing layer.

Case 1: Filtering

The sOrder.txt file holds ordering information, which you are required to query by specified start and end dates.

  A
1 =file(“D:\\sOrder.txt”).import@t()
2 =A1.select(OrderDate>=argStart && OrderDate<=argEnd)

Case 2: Finding top N within a group
Find out the products whose sales amounts rank top 10 in each department.

  A
1 =products=file(“d:/source.txt”).import@t()
2 =products.group(department).(~.top(10,quantity))

Case 3: Complex algorithms

The advantage of a computing layer is even more noticeable in dealing with order-related computations and complicated multi-step computations. Let’s look at an example. inout.txt holds the warehouse management statistics for many kinds of products per day. The dates having statistics are not consecutive and the initial inventory is 0. Now you need to fill in the missing dates and calculate the daily inventory. Below is the source file:

esProc_tech_structure_data_2

Here’s the code written by using a structured-data computing layer:

  A
1 =file(“d:/inout.txt”).import@t()
2 =periods(argBeginDate,argEndDate)
3 =A1.group(product;~.align(A2,date):g)
4 =A3.news(g;A3.product,A2(#):date,ifn(in,0):in, ifn(out,0):out, stock[-1]+in-out:stock)

Generate a continuous sequence A2 according to the start and end dates, group the source data by product and align the records in each group with A2, and calculate the inventory of the current date with the formula “inventory of the previous day + warehouse-in amount of the current day – warehouse-out amount”, and finally concatenate records of every group. Here’s the result:

esProc_tech_structure_data_3

Simplifying parallel algorithms

Parallel processing can be used to increase performance, especially in the cases when large files are involved ,but hardcoding a parallel algorithm is too complicated and difficult. In an effort to improve performance, programmers often need to divide a large file by byte before beginning the parallel processing. The file splitting could result in incomplete rows and, to make them usable, a strategy of “skip the head row and complement the tail row” is required. But it’s complicated to implement the strategy with hardcoding.

A structured-data computing layer is capable of simplifying the parallel algorithm by automatically skipping the head row and complementing the tail row. Here’s such an example.

Case: Perform grouping and aggregation on the large file sOrder.txt to calculate the sales amount and the number of orders each salesperson achieves per year.

Solution with structured-data computing layer:

  A B
1 =8  
2 fork to(A1) =file(“D:\\sOrder.txt”).cursor@t(;A2:A1)
3   =B2.groups(SellerId,year(OrderDate):y;sum(Amount):s,count(~):c)
4 =A2.merge()  
5 =A4.groups@o(SellerId,y;sum(s),count(~))

The above code carries out parallel processing with 8 sub-threads released. Each thread calculates a part of the file and returns result to the main thread, which merges the results returned from the sub-threads and perform a second grouping and aggregation to get the final result. A thread uses the cursor to open the file and imports a part attributed to it and performs grouping and aggregation.

The cursor function opens file as a cursor, splitting the file approximately into several parts by byte and importing one of the parts (which is expressed by A2:A1). The data importing performed by esProc will automatically skip the head row and complement the tail row so as to make sure the imported rows are complete. The merge function combines data sets as an ordered set. The groups function performs group and aggregate operations, in which @o option is used to perform a quick group operation on ordered data. The fork function encapsulates the complex multithreaded computation to enable programmers to focus on business logic, without paying extra attention to sophisticated semaphore control and achieving a faster development than hardcoding.

Here’s the result of the case:

esProc_tech_structure_data_4

The above data handling assumes that the result set of each group and aggregate operation is relatively small. But if the result set is too big to be loaded entirely into the memory, you can use the groupx function to perform the grouping and aggregation and return result as a cursor, and then use the merge@x function to combine the resulting cursors as a larger cursor. The algorithm structure remains the same.

Note: You should set JDBC stream for the Java main program for the convenience of accessing a large result set. That is to say, you need to first perform Statement. setFetchSize method and then the Statement.excute method.

Realizing heterogeneous data source handling

The handling of heterogeneous data sources involves algorithms such as lookup, joins, merge, alignment and set-oriented operations. Hardcoding needs multilevel nested loops to implement those algorithms, making development quite difficult. Since the algorithms can be classified under the relational algebra, they can be expressed by the computing layer that supports the relational algebra. Here’s an example.
 
Case1: structure.txt is a structured file separated by tab. JSON.txt contains JSON strings. The second column of structure.txt and part of JSON.txt are related through the foreign key. The requirement is to relate the two files into a two-dimensional table. Below is a selection of the source data:
structure.txt

esProc_tech_structure_data_5

JSON.txt

esProc_tech_structure_data_6

Solution with structured-data computing layer:

  A
1 =file(“D: \\JSON.txt”).read().import@j()
2 =A1.new(#1.name:name,#1.member.(#1):cluster)
3 =A2.derive(left(cluster,(firstblank=pos(cluster,” “)-1)):key,right(cluster,len(cluster)-firstblank):value)
4 =file(“D:\\ structure.txt”).import()
5 =join@1(A4,_2;A3,key)
6 =A5.new(_1._1,_1._2,_1._3,_1._4,_1._5,_1._6,_1._7,_1._8,_2.name,_2.value)

Import the JSON file to get the necessary fields and subdocuments to create a two-dimensional table, and calculate the foreign key values (the field name is key). Next, import the text file and perform a join and retrieve the desired fields from the joining result.

The join function is used to perform a join, an inner join by default. join@1 means a left join and join@f means a full join.

The final result is as follows:

esProc_tech_structure_data_7

Case2:

There are two files f1.txt and f2.txt, with both having the first row as the column names. You need to get the intersection of the Name fields of both files. Below are selections of the two files:

The file f1.txt:

esProc_tech_structure_data_8

The file f2.txt:

esProc_tech_structure_data_9

Solution with structured-data computing layer:

  A B
1 =file(“E:\\f1.txt”).import@t() =file(“E:\\f2.txt”).import@t()
2 =[A1.(Name),B1.(Name)].isect()  

The isect function calculates the intersection of sets. A1.(Name) means retrieving A1’s Name field, whose data type is the set. The final result of the case is as follows:

esProc_tech_structure_data_10

Apart from the isect function, there are union function for getting union, diff function for finding out difference, and conj function for calculating concatenation. They can represented respectively by the signs ^, &, \ and|. For example, the intersection of three sets can be written as A1.(Name) ^ B1.(Name) ^C1.(Name).

Enabling a standard way of handling cross-database computations

There are many ways of handling data from different databases. One way is to convert the cross-database computation to single database computation. It requires the configuration of an additional server and the process is complicated. Another way is using the DB2 Federation or Oracle DBlink to call the databases, but the computational performance and stability can’t be guaranteed, and the method applies to only a few types of databases.

Hardcoding uses extended classes like JoinRowSet and FilteredRowSet to perform the cross-database computation. But the extended classes only support join operations and don’t support algorithms such as set-oriented operations and merge operation. What’s more, they can only perform inner join, but can’t perform outer join. This way works fairly well with the same type of databases, but usually fails in compatibility between different types of databases, such as MySQL and Oracle.

With the computing layer, different databases can be transferred into the same model – two-dimensional structured table. The cross-database computations then become the special cases of handling the two–dimensional structured tables, and thus can be dealt with in a universal, all-around way.

Let’s look at an example.

Case: sales is a MySQL table, containing daily orders of salespeople and with SellerId field holding the numbers of the salespeople. Emp is an Oracle table, holding salespeople’s information and with EId being the numbers of the salespeople. You are required to create a report presenting OrderID, OrderDate, Amount, Name and Department according to the condition that the order dates are within the past N days (for example N=30) or that the orders belong to certain important departments (such Marketing dep. and Finance dep.) Below are the source tables:
The sales table:

esProc_tech_structure_data_11

The Emp table:

esProc_tech_structure_data_12

Solution with structured-data computing layer:

  A
1 $(myDB1)select * from sales
2 $(oracle1)select * from emp
3 =A1.switch(SellerId,A2:EId)
4 =A3.select(OrderDate>=after(date(now()),days*-1)|| depts.array().pos(SellerId.Dept))
5 =A4.new(OrderID,OrderDate,Amount,SellerId.Name:Name,SellerId.Dept:Dept)

Explanation: Perform SQL statements to import the source data from the data sources myDB1 and oracle1 respectively, and then establish a left-join relationship. Next perform the conditional filtering on the joining result based on the two conditions. One is that the order dates are within the last N days (the corresponding parameter is days), expressed by the expression OrderDate>=after(date(now()),days*-1); another is that the orders belong to certain important departments (the corresponding parameter is depts) , expressed by the expression depts.array().pos(SellerId.Dept). The operator || stands for the logical relationship “OR”. Finally, retrieve the needed fields.

A5 has the final result, as shown below:

esProc_tech_structure_data_13

Reducing system coupling

Besides the functional drawbacks, the hardcoding has some structural defects: lengthy code, overlapped functionalities, and high system coupling.

Non-database handling and multi-database handling involve a series of computational targets: the acceptance of passed-in parameters, and their data type conversion, for example, converting the date type to string type or splitting the strings into sets; retrieving data from the source file or the source database as the same type of data for the purpose of data source handling; the realization of algorithms including querying, sorting, grouping and aggregation, and merging according to the business logic; exporting result to the console, report and file according to certain requirements. If you write code for all those targets from the lowest level, you’ll definitely end up with having a large pile of obscure code, which is almost impossible to be divided into separate blocks. In addition, the computations tend to overlap each other. For example, all the three computations – file retrieval, record querying and grouping and aggregation – access the data cyclically, inducing programmers to realize them in a single loop, which appears convenient locally but makes the structure of the code confusing and entangling and consequently, increases the system coupling.

That tight coupling prolongs the development cycle and raises maintenance cost, together with the fact that different programmers write code in different style, makes it impossible to secure the performance and quality of the code. In order to reduce coupling, a thorough, systematic change of design is necessary, that is – making each computational target an independent module yet a complement to every other at the same time.

That is why the computing layer came into being.

With the computing layer, the computing modules can be separated from the Java main program, which merely needs to prepare the passed-in parameters and pass them to the corresponding computing module to get the result. The main program can switch between the computing modules freely and a computing module can receive calls from different main program. Both actions are independently maintained without interference from the other. The computing modules are easy to maintain too because each computing target is greatly simplified. For example, the data type conversion will be done automatically; file retrieval and database access can be realized with simple functions or statements and the results will always be the two-dimensional structured table. Algorithms like data querying and data sorting are also performed with simple functions, with performance and quality guaranteed. A proper combination of the computing modules can achieve a complete business target. Here’s an example.

Target: The original business logic is to import data from a text file and then perform grouping and aggregation. The new logic requires importing data from an Excel file, but keeps the algorithm unchanged.

If you attempt to modify the hardcoded program, you need to introduce a new jar (such as pio) to completely rewrite the part of the code for file retrieval. If this part of the code is coupled with grouping and aggregation, an even more extensive change is expected.

Solution with structured-data computing layer:
This is the original code for importing data from the text file:

  A
1 =file(“d:/source.txt”).import@t().groups(salesman;sum(value))

Here’s the code for importing data from Excel, with the import function replaced by importxls function:

  A
1 =file(“d:/source.xlsx”).importxls@t().groups(salesman;sum(value))

Using parameters can decrease the coupling. Supposing the parameter argFile is used to pass the file name, the code will be like this:

  A
1 =file(argFile).importxls@t().groups(salesman;sum(value))

A dynamic syntax can further reduce coupling. Supposing the parameter avgMethod is used to decide which type of aggregation is performed (sum/count/avg/max/min), the code will be like this:

  A
1 =file(“d:/source.xlsx”).importxls@t().groups(salesman;${avgMethod}(value))

With the structured-data computing layer, computing modules are loosely coupled, making it easy to modify the code in any module. esProc, when used as the computing layer, provides JDBC interface for invocation from the Java application, which will call the esProc file in the same way as it calls the stored procedure. This way eliminates the need of changing the Java main program and further reduces coupling.

FAVOR (1)
Leave a Reply
Cancel
Icon

Hi,You need to fill in the Username and Email!

  • Username (*)
  • Email (*)
  • Website