The Three Stages in Building Reports with Heterogeneous Data Sources

There are many different types of report data sources, including relational databases, NoSQL databases, local files, HDFS files and JSON data stream. It’s easy to build a report with a single data source, but it’s difficult to build one that needs data from more than one type of data source, i.e. heterogeneous data sources. The data preprocessing involving different data sources is always one of the bottlenecks in report development. This article will introduce three stages, from raw to perfect, in dealing with the heterogeneous data sources, and attempt in each stage to offer one or a few solutions with an example detailed.

The first stage: Foreign key lookups

Problem definition: Data set A has two fields – code and value, and data set B has multiple attribute fields, including a code field which is the logical foreign key pointing to data set A. The requirement is to present data set B in a report, where the code field is replaced with data set A’s value field.

Solution 1: SQL query

With each record of data set B displayed, code value will be passed as the parameter value to data set A, which will then receive a SQL query to show its value field in place of B’s code value in the cell.

Advantages: Most reporting tools support this solution. It just needs the simple configuration without writing an expression or a script.

Disadvantages: The solution requires multiple queries and thus has an unsatisfactory performance. It serves to display only the values referenced by the foreign key, but is far from being able to handle data preparation based on multiple types of data sources. Besides, the data source itself must support queries against it, like the database. Data sources that doesn’t support queries, such as files or the JSON data stream, just defy the solution.

Solution 2: Making use of the built-in functions of reporting tools, such as the script query.

With each record of data set B displayed, traverse data set A using a script to find out the corresponding value field to display in the cell. This is a more common method.

Advantages: Satisfactory performance and support for any type of data source.

Disadvantages: Sometimes a script is needed, but since different types of reports require different approaches, there isn’t a standard way to implement the solution. Some reports allow for direct traversal on data set A, but others require that data set A be stored in a global variable (like BIRT) in advance. Also, the solution has a very narrow range of applications, and is unable to handle all scenarios involving multiple types of data sources.

Example: Emp is an Oracle table and Sales is a MySQL table. SellerId field is the logical foreign key of the Sales table that points to the Emp table’s EId field. The requirement is to display the Sales table in a report where SellerId field is replaced with Name field of the Emp table. Below is the source data:

esProc_tech_mixedsources_1

We’ll use SQL queries to build the report in BIRT with the following steps:

1.Create data sources and datasets.

Create two data sources – oracleDB and mySQLDB.

Create data set dsEmp based on oracleDB using the SQL statement “select emptb.Name from emptb where emptb.EId=?”, in which “?” corresponds to the default parameter param_1.

Create data set dsSales based on mySQLDB using the SQL statement “select salestb.OrderID,salestb.Client,salestb.SellerId,salestb.Amount,salestb.OrderDate from salestb”.

2.Build a table-type report according to the dsSales data set:

esProc_tech_mixedsources_2

3.Switch to a new column binding:

Right click on cell [SellerId], select Change data Column, and click dsEmp in the Data Set drop down list and check “Name” field.

After the above operation, the Display Name for the cell will automatically turn into “Name”.
4.Set foreign key parameter

Click cell [Name], open the “Binding” page in the Property Editor, click “Data Set Parameter Binding” and edit the “param_1” in Expression Editor by changing its “Value”, and then select Available Column Bindings, table and SellerId.

After that, the value of param_1 will present itself as an expression – row[“SellerId”].

5.Flavor the report with a more tidy format, as the preview below shows:

esProc_tech_mixedsources_3

The second stage: Joining data sets

In the first stage, the problem of displaying values referenced by the foreign key has been solved. But it isn’t the handling of heterogeneous data sources in its real sense for report building. It is in this second stage that we begin to face the computational challenge.

Problem definition: Join different types of data sets outside the database to generate a single data source, and then present data from it in the report.

Solution: Use the built-in functions of reporting tools, such as BIRT’s Joint Data Set option or Jasper’s Virtual Data Source.

Advantages: Only a few steps are needed and no script is required. The solution has some degree of universality, applying to common cases of report development with heterogeneous data sources. It supports any other type of data source, in addition to the databases. But if the data source is the databases, a pre-join and a simple post-join are allowed.

Disadvantages: Only a few reporting tools support the solution. The built-in functions are tightly coupled with the reporting tools and thus almost resist portability. To join N data sets, we need to perform the operation for N-1 times and generate N-2 extra data sets. Despite the universality to some extent, it has too many restrictions to have the free computing power over heterogeneous data source handling.

Example:

Both Cities.txt and States.txt are tab-separated text files, related through StateId. The requirement is to build a group report to present the big cities of each state, with headers being Name and ShortName in States.txt and detailed data being Name and Population in Cities.txt. Below is a selection of the source data:

esProc_tech_mixedsources_4

We’ll use BIRT’s Joint Data Set to create the report in the following steps:

1.Create two data sets.

dsCities: The data comes from Cities.txt and uses the TSV format.

dsStates: The data comes from States.txt and uses the TSV format.

2.Create the joint dataset, by performing a left join on the two data sets.

esProc_tech_mixedsources_5

3.Design a group report according to the joint dataset, as shown below. No extra configuration is needed.

esProc_tech_mixedsources_6

4.A preview:

esProc_tech_mixedsources_7

The third stage: Free handling of heterogeneous data sources

A real touching of heterogeneous data source handling though, the second stage lacks many features to do a better job. If the data source itself doesn’t have computing ability, it’s hard to do a pre-join, such as files or JSON data stream. And the post-join can only be used in simple querying and static querying. It’s almost impossible to implement queries based on the flexible composite field and the variable, let alone the multi-step computations, such as data grouping, aggregation and re-filtering.

In addition, the operation of joining data sets has a tight coupling with the reporting tools, making it difficult to perform multilevel associations, and impossible to realize dynamic data sources and multi-data-source subreports.

While in the third stage, we can completely throw off all restraints to achieve free manipulation of heterogeneous data sources.

Solution 1: Combine heterogeneous data sources into a single data source by making use of the database.

Advantages: High performance and high data consistency.

Disadvantages: Substantial amount of work is needed in development, and the hardware and software costs are high.

Solution 2: Consolidate heterogeneous data sources using the user-defined data source or high-level languages like JAVA.

Advantages: It boasts the highest degree of computing freedom.

Disadvantages: Low performance; huge developing workload due to the lack of class libraries for structured-data handling; and extremely tight coupling with data sources and reporting tools.

Solution 3: Professional computing middlewares, such as esProc.

Advantages: This solution is essentially the variant of the previous solution. Take esProc as an example. It works as the computing middleware with loose coupling, supports any type of data source, provides rich class libraries for handling structured data to significantly reduce the workload in program development, requires no extra hardware cost, produces simple and high-performance parallel code, as well as enables convenient integration with reporting tools by offering JDBC interface.

Disadvantages: Despite the light developing workload, the solution is unable to achieve zero coding. And it pales in maintaining data consistency in comparison to the data warehouse.

Example:

Sales is a MySQL table holding each day’s orders of some salespeople. The SellerId field contains these salespeople’s numbers. Emp.txt is a text file, recording information of the salespeople. EId field contains their numbers. The requirement is to present the OrderID, OrderDate, Amount, Seller Name and DeptName with the condition that the ordering dates are within the past N days (Suppose N=30) or the orders belong to certain important departments (such as Marketing and Finance).

The Sales table:

esProc_tech_mixedsources_8

The Emp.txt file:

esProc_tech_mixedsources_9

All the three solutions in the above apply to this case. Here we choose to use esProc to build the report in Jasper. Steps are as follows:

1.Write esProc code:

	A
1	=myDB1.query(“select * from sales”)
2	=file(“d:\\Emp.txt”).import@t()
3	=A1.switch(SellerId,A2:EId)
4	=A3.select(OrderDate>=after(date(now()),days*-1)\|\| depts.array().pos(SellerId.Dept))
5	=A4.new(OrderID,OrderDate,Amount,SellerId.Name:Name,SellerId.Dept:Dept)

Explanation: Query records of the Sales table from myDB1 and import records from Emp.txt. Create an association between them through left join. Filter the related data set according to the condition that the ordering dates are within the last N days (corresponding parameter is days) (expression: OrderDate>=after(date(now()),days*-1)), or that the orders belong to certain important departments (corresponding parameter is depts) (expression: depts.array().pos(SellerId.Dept)). The operator || represents the logical relationship of “OR”. Finally, retrieve all the desired fields.

The switch function replaces the foreign key values with the corresponding referenced records. Here A1’s field is replaced with A2’s records, as shown below:

esProc_tech_mixedsources_10

A5 gets the final result as follows:

esProc_tech_mixedsources_11

2.Prepare data source and data sets

esProc provides JDBC interface to the external applications. Both esProc and the common databases have the same way of creating data sources. For more details, see How to Use esProc to Assist Reporting Tools. Note: With the JDBC driver, esProc script can be accessed by any Java-based reporting tool, such as Jasper, BIRT, crystal, Style Report and FreeReportBuilder.

After the data source is prepared, save the esProc script as afterjoin1.dfx, which can be called from JasperReport SQL Designer using the code afterJoin1 $P{pdays},$P{pdepts}. The parameters pdays and pdeps correspond to the two parameters in the esProc script.

3.Design a simple list, without extra configuration:

esProc_tech_mixedsources_12

4.A preview of the report:

esProc_tech_mixedsources_13

Applications of the third-stage solutions

Below are several typical types of reports with heterogeneous data sources. To check the universality of the third-stage solutions, we build them using these solutions.

Case 1: Dynamic data sources

The data sources myDB1 and oraDB point to different databases, which each hold a same-structure sOrder table. The desired report will be able to connect to the data sources dynamically depending on the parameter to query sOrder and display orders whose amounts are greater than 1000.

Below is a selection of sOrder in myDB1:

esProc_tech_mixedsources_14

Below is a selection of sOrder in oraDB:

esProc_tech_mixedsources_15

Both foreign key lookup and the method of joining data sets are helpless in creating such a report, so we can only use the solutions in the third stage. Below is the esProc code:

	A
1	=${pSource}.query(“select * from sOrder where Amount>?”,pAmount)

pSource and pAmount are report parameters. The former represents the data source name, in which ${…} means parsing a string or a string variable into an expression. The latter represents the ordering amount.

When pSource value is “myDB1”, A1’s result is as follows:

esProc_tech_mixedsources_16

When pSource value is “oraDB”, A1’s result is as follows:

esProc_tech_mixedsources_17

The main report with a subreport using heterogeneous data sources

Design a main report with a subreport to present ordering records of each employee by salary range. The source of the data used by the main report is emp, a MySQL table, and the source of the data used by the subreport is sales, a MSSQL table.

For a report using the same type of data source, if the main report and the subreport get data from different sources, the database URLs should be explicitly passed in or the different data sources of the same type should be combined using Java classes. The first method will compromise security and the second one will produce complicated code. But with the-third-stage solutions, this can be handled conveniently. Below is the esProc script:

empEsProc.dfx (This the script for the main report)

	A
1	=myDB1.query(“select * from emp where salary between ? and ?”, low,high)

A1: Query the emp table in the MySQL database according to the salary ranges.

salesEsProc.dfx (This the script for the subreport)

	A
1	=msSQL1.query(“select * from sales where SellerId=?”,eid)

A1: Find out the matching orders from the sales table in MSSQL database by employee ID. Suppose eid is 1, then A1’s result is as follows:

esProc_tech_mixedsources_18

We can see that the two data sources have been combined into one data source, and the main report and the subreport just need to call different esProc scripts to display their data.

There’s another type of report called multi-data-source subreports, which means a main report has multiple subreports (or table controls) that use different data sources. Only a third-stage solution is capable of building such a report.

Case 2: Multilevel associations
The Orders table is the fact table stored in the database, the Customer, Product, Region and Supplier tables are dimension tables coming from four files. Their relationships are shown as follows:

esProc_tech_mixedsources_19

The requirement is to find out the orders whose customers and suppliers belong to the same region, calculate the number of the orders and their amount in each city, and present the result in a report.

The method of joining data sets can be used to handle the report building in this case, but it will generate a lot of intermediate result sets. By comparison, the third-stage solutions are much simpler. Below is the esProc code:

	A
1	=myDB1.query(“select * from sales”)
2	=file(“D:/files/customer.txt”).import@t()
3	=file(“D:/files/product.txt”).import@t()
4	=file(“D:/files/supplier.txt”).import@t()
5	=file(“D:/files/region.txt”).import@t()
6	=A2.switch(city,A5:city)
7	=A4.switch(city,A5:city)
8	=A3.switch(sid,A4:sid)
9	=A1.switch(pid,A3:pid; cid,A2:cid)
10	=A1.select(pid.sid.city.region==cid.city.region)
11	=A10.groups(cid.cid:cid;count(oid):count,sum(price*quantity):amount)

Explanation: Import the text files and establish relationships between the fact table and the dimension tables. Then query the desired orders according to these relationships and perform group and aggregate operations.

Case 4: Complex data compuation with heterogeneous data sources

According to the database table Sales and the emp.txt file, find out the top three days when the sales amount of each salesperson increases the most after a given date, and display the salespeople’s names, the three dates, the sales amounts and the growth rates in a report.

Limited by the computing power, the second-stage solution can’t perform the post-join multi-step computation this case will present. But a third-stage solution can handle it easily. Below is the esProc code:

	A
1	=myDB1.query(“select * from sales where OrderDate>=?”,beginDate)
2	=file(“d:\\Emp.txt”).import@t()
3	=A1.switch(SellerId,A2:EId)
4	=A3.group(SellerId)
5	=A4.(~.groups(OrderDate,SellerId;sum(Amount):subtotal))
6	=A5.(~.derive((subtotal-subtotal[-1])/subtotal[-1]:rate))
7	=A6.(~.select(#!=1))
8	=A7.(~.top(3,-rate))
9	=A8.union()
10	=A9.new(SellerId.Name:Name,OrderDate,subtotal,rate)

Explanation: Create data sets by retrieving data from the two data sources and create an association between them. Group the related result set by SellerId. Then regroup each group of data by OrderDate and SellerId and aggregate the order amount for every group, that is, the sales amount per day per salesperson. A6 calculates the daily growth rate of the sales amount for each salesperson, using the formula “Sales amount of the current day – Sales amount of the previous day)/ Sales amount of the previous day”. esProc uses subtotal[-1] to represent the sales amount of the previous day, expressing relative positions easily. In the next step, remove the first record from each group (because it doesn’t have the growth rate), find out the records of the top three dates when the sales amount of each salesperson increases the fastest, and concatenate the desired records from each group. In the last step, export all the desired fields as follows:

esProc_tech_mixedsources_20

The first stage: Foreign key lookups

The second stage: Joining data sets

The third stage: Free handling of heterogeneous data sources

Applications of the third-stage solutions

Hi，You need to fill in the Username and Email！