Program with Agile Syntax of esProc on Hadoop

Hadoop is an outstanding distributed computational system whose default developing mode is MapReduce coding. However, MapReduce is not specially designed for data computing. Plus, its syntax mechanism is cumbersome, the coding efficiency for computation is relatively low, and it is even more difficult to compose the universal algorithm.

Regarding the agility of syntax, esProc out-perform MapReduce a lot.

Here is an example illustrating how to develop Hadoop codes with esProc. Take the common Group algorithm in MapReduce for example: According to the order data on HDFS, sum up the sales amount of sales person, and seek the top N salesman. In the example code, the big data file fileName, fields-to-group groupField, fileds-to-summarizing sumField, syntax-for-summarizing method, and the top-N-list topN are all parameters. In esProc, the corresponding codes are shown below:

Code for summary machine:
esproc_agile_syntax1

Code for node machine:
esproc_agile_syntax2

How to perform the parallel computation over big data? The most intuitive idea occurs to you would be: Decompose a task into several segments; distribute them to the unit machine to summarize initially; and then further summarize the summary machine for the second time.

From the above codes, we can see that esProc has classified the distributed computations into two categories: The respective codes for summary machine and node machine. The summary machine is responsible for decomposing task, distributing the task to every node in the form of parameter, and ultimately consolidating and summarizing the computational results from node machines. The node machines are used to get a segment of the whole data piece as specified by parameters, and then group and summarize the data of this segment.

Then, let’s discuss the above-mentioned codes in details.

Variable definition

As can be seen from the above codes, esProc is the codes written in the cells. Each cell is represented with a unique combination of row ID and column ID. The variable is the cell name requiring no definition, for example, in the summary machine code:

A2: =40

A6: = [“192. 168. 1. 200: 8281″,”192. 168. 1. 201: 8281″,”192. 168. 1. 202: 8281″,”192. 168. 1. 203: 8281”]

A2 and A6 are just two variables representing the number of tasks and the list of node machines respectively. The other codes can reference the variables with the cell name directly. For example, the A3, A4, and A5 all reference A2, and A7 references A6.

Since the variable is itself the cell name, the reference between cells is intuitive and convenient. Obviously, this method allows for decomposing a great goal into several simple steps, and achieving the ultimate goal by invoking progressively between steps. In the above codes: A8 makes references to A7, A9 references the A8, and A9 references A10. Each step is aimed to solve a small problem only. Step by step, the computational goal of this example is ultimately solved.

External parameter

In esProc, a parameter can be used as the normal parameter or macro. For example, in the code of summary machine, the fileName, groupField, sumField, and method are all external parameters:

A1: =file(fileName). size()

A7: =callx(“groupSub. dfx”,A5,A4,fileName,groupField,sumField,method;A6)

They respectively have the below meanings:

filename, the name of big data file, for example, ” hdfs: //192. 168. 1. 10/sales. txt”

groupField, fields to group, for example: empID

sumField, fields to summarize, for example: amount

method, method for summarizing, for example: sum, min, max, and etc.

If enclosing parameter with ${}, then this enclosed parameter can be used as macro, for example, the piece of code from summary machine

A8: =A7. merge(${gruopField})

A9: =A8. groups@o(${gruopField};${method}(Amount): sumAmount)

In this case, the macro will be interpreted as code by esProc to execute, instead of the normal parameters. The translated codes can be:

A8: =A7. merge(empID)

A9: =A8. groups@o(empID;sum(Amount): sumAmount)

Macro is one of the dynamic languages. Compared with parameters, macro can be used directly in computation as codes in a much more flexible way, and reused very easily.

Two-dimensional table in A10

Why A10 deserves special discussion? It is because A10 is a two-dimensional table. This type of tables are frequently used in our data computation. There are two columns, representing the character string type and float type respectively. Its structure is like this:

esproc_agile_syntax3

In esProc, the application of two-dimensional table itself indicates that esProc supports the dynamic data type. In other words, we can organize various types of data to one variable, not having to make any extra effort to specify it. The dynamic data type not only saves the effort of defining the data type, but is also convenient for its strong ability in expressing. In using the above two-dimensional table, you may find that using the dynamic data type for massive data computation would be more convenient.

Besides the two-dimensional table, the dynamic data type can also be array, for example, A3: =to(A2), A3 is an array whose value is [1,2,3…. . 40]. Needless to say, the simple values are more acceptable. I’ve verified the data of date, character string, and integer types.

The dynamic data type must support the nested data structure. For example, the first member of array is a digit, the second member is an array, and the third member is a two-dimensional table. This makes the dynamic data type ever more flexible.

Computational functions for massive data

In esProc, there are many functions that are aimed for the massive data computation, for example, the A3 in the above-mentioned codes: =to(A2), then it generates an array [1,2,3…. . 40].

Regarding this array, you can directly compute over each of its members without the loop statements, for example, A4: =A3. (long(~*A1/A2)). In this formula, the current member of A3 (represented with “~”) will be multiplied with A1, and then divided by A2. Suppose A1=20000000, then the computational result of A4 would be like this: [50000, 100000, 1500000, 2000000… 20000000]

The official name of such function is loop function, which is designed to make the syntax more agile by reducing the loop statements.

The loop functions can be used to handle whatsoever massive data, even the two-dimensional tables from the database are also acceptable. For example, A8, A9, A10 – they are loop functions acting on the two dimensional table:

A8: =A7. merge(${gruopField})

A9: =A8. groups@o(${gruopField};${method}(Amount): sumAmount)

A10: =A9. sort(sumAmount: -1). select(#<=10)

Parameters in the loop function

Check out the codes in A10: =A9. sort(sumAmount: -1). select(#<=10) sort(sumAmount: -1) indicates to sort in reverse order by the sumAmount field of the two-dimensional table of A9. select(#<=10) indicates to filter the previous result of sorting, and filter out the records whose serial numbers (represented with #) are not greater than 10. The parameters of these two functions are not the fixed parameter value but the computational method. They can be formulas or functions. The usage of such parameter is the parameter formula. As can be seen here, the parameter formula are also more agile regarding the syntax. It makes the usage of parameters more flexible. The function calling is more convenient, and the workload of coding can be greatly reduced. From the above example, we can see that esProc can be used to write Hadoop code with its agile syntax. By doing so, the code maintenance cost is greatly reduced, and the code reuse and migration would be ever more convenient. (To be continued.)

Hi，You need to fill in the Username and Email！