esProc’s Use in ETL

Uncategorized 679 0

The purpose for ETL is to get higher quality data, with better performance, and in a more convenient way. Actual ETL tasks includes basic data migration, data correction, as well as many industry specific business logics, such as:

1.Transform the raw sales data from the store into data warehouse, with one table to store specific data: commodities for which the current day out-of-stock index is higher than 90%.

2.Every month the employee performance is computed. The performance computation algorithms for each department and each position are different.The algorithm involves time in the job, colleagues assessment, attendance record, customer feedback, etc.. Data are extracted from different business systems, with MSSQL, Oracle, Excel, etc.,as data sources.

3.The owner of one website, whose site generates Giga Bytes of access log every day, wants to gain insight on their
members in term of their attention on commodities. The idea is that the intermediate results of ETL could reflect the daily times spent by the members on specific commodities.

Out of stock risk, performance algorithms, attention on commodity, these are all industry specific business logics. As with all the business logics, there are certain characteristics and complexity. However the configuration interface of ETL tool can only perform generic format conversion, data correction and simple filtering. Business logics cannot be handled by ETL tool, which still needs hard coding.

Many ETL tools have built-in scripting language for hard coding.These scripts are normally good at logic and flow control, and can be used for simple data processing, such as: data type validation, adding missing fields, removal of data duplications, plus data internationalization and standardization. However these scripting language lack of computation syntax or function for batch structured data, such as: aggregation, grouping, query, sorting, ranking, etc.. Its ease of use cannot even compare with SQL and can be used to implement regular business algorithms, such as the above-mentioned out-of-stock-risk, the performance algorithm and the attention on products.

ETL tools can also call external languages such as: TCL, Perl, python, JAVA. They also lack of batch data computation syntax,and library functions for structured data. All the basic functions need to be coded by the programmers. Similar to built-in scripts, these external languages are difficult to develop. The degree of convenience is far less than SQL.

ETL scripts and external language are not as convenient as SQL. Thus many users would consider migrating the data to a dedicated database for computing, to realize the ETL business logics with SQL/SP. In this way, the ETL process becomes ELT (actually ELTL). SQL is not a procedure-oriented language. It’s very difficult to write complex business logic with it. The performance of SP is worse than SQL,which means SQL is still a necessity for batch computation and grouping operations. Except for the performance advantage without having to retrieve the data from database, there is no difference from using JAVA to call SQL. Many business logics are still difficult to code. This is one of the issues with using SQL / SP to implement business logics in ETL.

Another problem is how to deal with heterogeneous data sources. ETL often encounters difficulties with cross database access and the need to access data outside the database, or diverse unstructured data sources such as log information. In example 2 above, the data sourceis MSSQL, Oracle and Excel. In example 3 the data source is semi-structured website logs. SQL cannot be used directly.

Suppose we could retrieve these diverse data sources into one database for computation, there will be the third question: it’s high cost and low performance. Additional database for temporary data storage will increase cost. On the other hand, the original data volume is usually very big. Loading them into intermediate database is an extremely time consuming process. The IO performance for this is normally much lower than direct copying into files directly. As in the above-mentioned example 3, loading one day’s log file into database takes several hours.

Current solutions do not provide any good options to resolve the issue with business logics prevalent in ETL environment: the configuration interface of ETL tools cannot do the job; built-in and external scripts are difficult to develop; SQL / SP are hard to code, and there are issues with diverse data sources, as well as the high-cost-and-low-performance concern.

esProc is a good solution for above issues.

A rich collection of built-in data processing functions

esProc is a programming language designed for computation of structured data, which provides a rich family of built-in objects and functions to implement complex business logic. Many ETL issues, which are challenging for SQL / SP, can be easily resolved with esProc. esProc supports true set data types to simplify the computation of structured data, which helps the user to conduct flexible computation from a business perspective. esProc supports ordered set, allowing free access to set members and completion of order related computations, such as: ranking, sorting, comparison with last period, comparison with the same period, etc.. Set of sets is an easy expression of grouping. esProc’s built-in equivalent grouping, alignment grouping, and enumeration grouping can be used to resolve all kinds of challenging issues. Meanwhile, single record in the data set could be operated on in the same way as an object. This kind of dissociated record offered the user with more freedom and better data access experience.

Cell-style scripting supports efficient development

esProc is a cell-style language, with computing logics conveniently located in 2-dimensioned cells. It’s easier for the business algorithm to be expressed in computer language. By nature these cells supports step-by-step computation, with each cell representing one step, or part of the computation.

esProc is good at transferring the complex business logics in ETL into a few simple steps. In the cells, the code indentation and scope are more intuitive, and references to / reuse of cells are easier to be accomplished. Cells can be referred to by their cell name, without the need for variable definition. By clicking on a cell, users can observe the results intuitively, without having to search in the list of variables. Cells also provide real debugging capabilities for ETL.

Built-in computing engine supports diversified data

esProccan access a variety of relational databases and NoSQL database directly. It allows easy-to-use cross-database computing with heterogeneous or homogeneous databases. esProc can also read from txt or Excel files directly for structured data, to support hybrid computing with heterogeneous data sources. esProc has built-in text parser supporting string parsing and regular expressions, which could be used to handle semi-structured data. Processed data can be used directly in hybrid computing.

High-performance single machine computing and low-cost scalability

esProc can retrieve the source data directly for computation, without the need for prior data loading to a database. This avoids inefficient database IO operations, and the computing performance of a single machine can even exceed the database. esProc can also replicate data to distributed file systems like HDFS, to support scale-out and parallel computing needed for handling of Big Data. esProc leverages file system and cheap hard disk to store data temporarily (in many cases temporary storage is not needed), which eliminated the need for buying additional database storage device and licenses. It helps to ensure high performance with effectively lower cost.

esProc can effectively handle complex computing, heterogeneous data source computing, and high performance computing, which makes it a very important tool for ETL process.

FAVOR (0)
Leave a Reply
Cancel
Icon

Hi,You need to fill in the Username and Email!

  • Username (*)
  • Email (*)
  • Website