Report developers always encounter some complex reports, which are few in number but need a lot of development time.Sometimes,they even become difficult problems. This article will dig for the root of such complexity, and the methods to resolve them. Hopefully this helps to improve report development efficiency.
Previously, the complexity of complex report mainly exists in the front-end:
1.Cell merging, slash header.
2.Font style change according to the value of the data.
3.Cross-cell computation between any cells. Such as, for a few detailed data that needs special attention, computing their summary value and its percentage to the total.
4.Regular computation for different range. For example, the column headers are product categories, while row headers consist of two parts: three layers of organizational grouping and one layer of year grouping. At the intersection the revenue for specific products needs to be displayed for different grouping.
5.Irregular grouping. For example, for each state in the list displays the total revenues for the state and state capital in turn (When computing the totals, the revenue for state capital should not be added twice).
After decades of development, most of the front-end problems have been properly resolved. Reporting tools like Style Report, Runqian Report, QlikView and Tableau use different approaches to solve above problems.
Currently, the complexity of the reports remains in the back-end, e.g., the computation of data sources.
a.Complex business logic. Such as, searching for sales whose revenue growth is over 10 percent in 3 consecutive months. Showing their sales revenue and sales quantity. Also: looking for customers who bought all items in the parameter list, and show their account balances.
b.Cross-database computing. For example: a company wants to compute the actual salary for their staff in different position, according to staff’s performance and basic salary.The basic salary of employees are storedin their accounting software based on MSSQL database, while the performance scores are stored in a KPI system based on Oracle. The performance scores have to be converted into certain amount of salary, and displayed in the report. Also, a chain store has databasesin each of itsstores. The headquarter needs to present the summary result of these data with a report.
c.Computation on non-database data sources. For example, from the data numerous stocks stored in Excel sheets, select those whose prices rise in 5 consecutive days. Also, according to the data in log file, show the time of attention every customer paid to each product in a specific time period.
d.Combining multiple data sources into a single one. For example, reporting tools like BIRT, Crystal Report and Japser Report cannot support multiple databases in a convenient way. Users are often required to write custom data sources to combine multiple data sources into a single one.
Reporting tools only need to present the data retrieved, and have nothing to do with generating the back-end data. Report developer must come out of ways to solve the above problems, so the complexity in report back-end remains the biggest obstacle for report developers. It’s also the main contributor of complexity in complex report. Also, modern reports tend to be simpler and easier to read, with lowered requirements on complicated report styles and more attentions on data. The focus on complex report has already been shifted from the front-end presentation to back-end data sources. In fact, the complexity in front-end can also be resolved from the back-end. For example, regular computation for ranges, irregular grouping, etc., which makes the computation withreport data sources more important.
To resolve the complexity in report data source computing, we can use SQL/SP, intermediate database or custom data sources.
Theoretically,SQL/SP can solve issues with complex business logic. However, it’s limited to a single database and falls short in other cases (such as: non-database data sources, cross-database computing and combining multiple data sources into a single one). On the other hand, it is not easy for an ordinary report developer to implement complex business logic and need more experienced programmers to handle. Thus, only limited problems can be solved with SQL/SP, and this is done with high demand on personal skills. The disadvantage is obvious.
Intermediate database can be used to do cross-database computing, namely, retrieving data from heterogeneous databases to a single intermediate database, and then use reporting tools to access the unified view in it. Intermediate database generally needs separate license purchase, which brings extra cost to the customer. Also, its data has to be loaded first, and thus less suitable for real-time purposes. If real-time, incremental retrieval has to be achieved, we must create scheduled tasks or add triggers and time stamps to the source databases, and write more complex data loading scripts. Obviously, the need for additional development work will increase significantly. Apart from this, many commercial software vendors do not allow customers to modify the underlying databases in their software to add triggers and time stamps. Therefore, this kind of performance improvement isimpossible. From this we can see, the shortcomings with intermediate database are the high cost, heavy development workload, and lack of guarantee in real-time processing and performance.
Computation of non-database data sources can also bed one through intermediate database, with similar advantages and disadvantages. The maindifference is in the cost and the ability to do real-time processing. First, non-database data sources are difficult to be added with trigger and time stamp to do real-time computing. Secondly, many non-database data sourceshave large data volumes, which can occupyvaluable storage space and yields higher cost. As with the above mentioned example to computing the customer’s time of attention from the data in log file, log entries for commercial website can add up to tens of millions every day, and data for one year could be as much as several TB. Intermediate database has to be upgraded from time to time. Large data amount can also impact performance significantly. In order to enhance the performance of the database, parallel database must be used, which is very expensive.
Custom data source is the interface provided by most reporting tools as a convenient way to combine multiple data sources into a single one.The most common type is user-defined Java Bean. User-defined Java Bean allows developers to use high-level language to unify heterogeneous databases, data sources, and semi-structured data sources into a single data source. The advantage is flexibility, which allows the developer to do almost anything. But the disadvantage is also obvious. Different from professional data computation languages such as SQL/esProc/R, Java and other high-level languages lacks structured data computation functions. This means developers must first implementa lot of underlying functions, such as those for filtering, grouping, aggregation, ranking, sorting, unique value, association algorithm, etc., before they can do any computing. This is a huge development effort. For general or even relatively simple algorithms, the implementation with high-level language is extremely difficult.
R language can also be used as a custom data source. Its advantage is the rich library functions that can be used for mixed computation. The disadvantage is the lack of JDBC interface, and the poor performance and stability. Thus in practice there is very few people using it for complex computationsrequired in report data sources.
esProc is another custom data sources. Similar to SQL, esProc is a professional data computing language with rich structured data computation functions, to help customers solve the first category of problems easily: handling complex business logic.Similar to R language, esProc can directly access database, text files, Excel and semi-structured data for mixed mode computing without intermediate staging database.This enables high-performance, low-cost cross-database computing and support for non-database data sources computing.
esProc is designed for application developers. It comes with simple syntax. It’s easy to code, highly interactive, and rich of debugging functions. There are less technical requirements for developers. esProcprovides JDBC interface.Reporting tools can access esProc’s computing resultin the same way as they do with the database. This means the integration is very good. With built-in parallel computing engine in esProc, large data computation tasks can be broken down for processing by multiple low-cost PC nodes. The performance is superior.
esProc can handle a special kind of complex reports: large concurrent reports. It’s neither difficult to do front-end design for these reports, nor the computation of the data source. And the data amount is not large. Real difficulty comes with the large number of computing tasks, and the pressure for concurrency. esProc supports parallel computing.Therefore a large concurrent job can be balanced across multiple node machines to complete, with each node completing only limited computing tasks. In addition, esProc supports cheap scale-out, which allows the customer to upgrade by adding nodes with continued increase of report concurrency. This is easier, cheaper and more effective than upgrading the database.
The following example illustrates how esProc should be used. In the case of the above mentioned cross-database computation: actualsalary is computed according to the performance score.
Cross-database association computing: A5
Complex business logic: A6-C9
Structured data computation functions: A4, A10-A13
JDBC output: A14