Related Computing in esProc – Data Grouping and Summarizing

It is a common task to group records in a table sequence as needed and to summarize the data in each group during data analysis and statistics. This includes data summarizing like summing up and averaging, as well as the listing of the top n records, and etc. This article will explore how to group and summarize the data of a table sequence in esProc.

1. Compute distinct values

Of the fields of all the records, there are some that have different values, like sequence numbers, though lots of them could have duplicate values. Sometimes all the different values are required to be listed, and, in this case, id function will be used. For example:

	A
1	=file(“Order_Books.txt”).import@t()
2	=A1.id(SalesID)
3	=A1.id(PName)

Only part of the data of the table sequence in A1is displayed in the following window:

esProc_related_group_summarize_2

A2 gets all the different IDs of salespersons, i.e. SalesID; A3 gets all the different names of books, i.e. PName. Results of A2 and A3 are as follows:

esProc_related_group_summarize_3

As can be seen from the results, id function will sort values in ascending order when it is used to compute distinct values.

Actually, id function returns a result similar to that returned by the distinct statement in SQL.

	A
1	=demo.query(“select * from EMPLOYEE”)
2	=A1.id(STATE)
3	=demo.query(“select distinct STATE from EMPLOYEE”)

The following is a table sequence of employee information A1 selects from the database:

esProc_related_group_summarize_5

A2 gets all different states from the table sequence in A1 using id function; while A3 gets them from the database using the distinct statement. Results of A2 and A3 are as follows:

esProc_related_group_summarize_6

We can see that the two methods return the same result. The difference is that id function returns the result in sequence, while the SQL statement returns a result of table sequence from the database.

But sometimes we want to keep the original order of the data. To achieve this, @o option can be used. For example:

	A
1	=demo.query(“select * from EMPLOYEE”)
2	=A1.id@o(STATE)

Now the result of A2 is like this:

esProc_related_group_summarize_8

It can be seen that the id@o function doesn’t sort the data, but it directly removes the neighboring data of the same value instead. So there could be duplicate values left in the result, like Texas, California and others in the above result.

Using id function, we can also list all the different results of a certain expression rather than use fields:

	A
1	=demo.query(“select * from EMPLOYEE”)
2	=A1.id@o(age(BIRTHDAY))

A2 lists the ages of employees:

esProc_related_group_summarize_10

2. Equal grouping

We often need to group data according to a certain value, such as group employees by departments, group people by gender, and etc. In esProc, group function is used to group data of a table sequence or of a record sequence and return a sequence composed of groups. For example:

	A
1	=demo.query(“select * from EMPLOYEE”)
2	=A1.group(DEPT)
3	=A1.group@o(DEPT)
4	=A1.group(year(BIRTHDAY))

A2 groups employees by DEPT. The result is shown below:

esProc_related_group_summarize_12

It can be seen that the result is a sequence whose members are groups, each of which is a record sequence, consisting of employee data of the same department, and is sorted by the department in ascending order.

group function can be used with @o option, which means keeping the original order and putting the neighboring data of the same values into one group. This is what A3 does. It only puts the neighboring employees of the same department into one group when grouping data by department. The result is as follows:

esProc_related_group_summarize_13

During data grouping, we can create a group only containing data that make a certain expression produce the same result. For example, A4 groups employees by the birth year. Result is as follows:

esProc_related_group_summarize_14

The operation of group function is similar to that of group by in SQL. Its return value is a sequence consisting of groups. Its final result is a sequence composed of sequences. The summary value of each group can be used for further computation based on the grouping result. This is different from SQL that hasn’t the explicit set data types and cannot store the grouping result. In SQL, each group of data must be summarized immediately when group by is executed, and after that the grouping result will be dumped and thus cannot be reused.

In esProc, we can summarize the result of each group after data grouping. For example:

	A
1	=demo.query(“select * from EMPLOYEE”)
2	=A1.group(DEPT)
3	=A2.new(DEPT,count(~):Count,~.sum(SALARY):TotalSalary)
4	=A2.new(DEPT,~.count(GENDER==”M”):Male, ~.count(GENDER==”F”):Female)

Using the grouping result, A3 computes the number of employees and the total salary of each department. A4 computes the number of male employees and female employees in each department. The result of grouping and summarizing will be returned as a table sequence. Results of A3 and A4 are as follows:

esProc_related_group_summarize_16

We can see that A3 and A4 perform different summarizing operations on the same grouping result of A2. That the grouping result can be reused is one of the main features of esProc.

An explanation for the expression =A2.new(DEPT,count(~):Count,~.sum(SALARY):TotalSalary) in A3: As A2 is the grouping result of the A1’s table sequence, each member of A2 is a set of records, that is, a record sequence. Therefore when new function is used to perform loop computation on A2, the targets of its expressions are the record sequences. For example, DEPT means getting the value of DEPT field of the first record of the current member; ~.sum(SALARY) means summing up the SALARY field of the current record sequence.

Similar to the group by in SQL, group function also supports grouping multiple fields (expressions) simultaneously. In that case, only fields that have the same values can be grouped together.

3. Grouping and summarizing

In the previous section, we explained how to group the records in a table sequence and then summarize the records of each group using the grouping result. Alternatively we can use group function to directly get the grouping result, saving the trouble of computing it step by step. For example:

	A
1	=demo.query(“select * from EMPLOYEE”)
2	=A1.group(DEPT;~.count():Count,~.sum(SALARY):TotalSalary)
3	=A1.group(DEPT;~.count(GENDER==”M”):Male, ~.count(GENDER==”F”):Female)
4	=A1.group(DEPT;round(~.avg(age(BIRTHDAY)),2):AverageAge)

Both A2 and A3 directly group and summarize the data by DEPT field, getting the same result as that got by step-by-step computation:

esProc_related_group_summarize_18

A4 computes the average age of employees in each department:

esProc_related_group_summarize_19

Actually esProc also provides groups function for directly computing the result of data grouping and summarizing. For example:

	A
1	=demo.query(“select * from EMPLOYEE”)
2	=A1.groups(DEPT;count(~):Count,sum(SALARY):TotalSalary)
3	=A1.groups(DEPT;count(GENDER==”M”):Male, count(GENDER==”F”):Female)

A2 and A3 group and summarize the data by DEPT field directly. Notice that “~.” is abandoned here in the code. This is slightly different from the code in the above, but the result is the same as that got by grouping first and then summarizing:

esProc_related_group_summarize_21

The returned result of groups function for grouping and summarizing data directly is also a table sequence. groups function for performing data grouping and summarizing won’t record the data in each group, it only accumulates the results got by computing the expression. But compared with group function used in the above, groups function is more efficient.

Note: The use of groups function for directly grouping and summarizing data through accumulation requires that only simple aggregate functions, such as sum/count/max/min/topx/avg, can be used in summarizing expressions. For other more complicated ones, it is group function, instead of groups function, that should be used.

One thing worthy of note is that topx, one of the summarizing functions, can be used to compute the n smallest values. For example:

	A
1	=demo.query(“select * from EMPLOYEE”)
2	=A1.groups(DEPT;topx(age(BIRTHDAY),3))

A2 computes the ages of three youngest persons in each department after data grouping:

esProc_related_group_summarize_23

topx can only computes the n smallest results. But if the ages of three oldest persons are needed, we can use the following method:

	A
1	=demo.query(“select * from EMPLOYEE”)
2	=A1.groups(DEPT;topx(BIRTHDAY,3))
3	=A2.new(DEPT,#2.(age(~)))
4	=A1.groups(DEPT;topx(-age(BIRTHDAY),3))
5	=A4.new(DEPT,#2.(-~))

A2 first computes the 3 smallest birthdays in each department, and then A3 further computes the corresponding ages. Results of A2 and A3 are as follows:

esProc_related_group_summarize_25

Or we can make it by adding a negative sign to topx. For example, A4 gets the 3 smallest ages by negative values, and then A5 computes the desired ages. Results of A4 and A5 are as follows:

esProc_related_group_summarize_26

Notice that the results of A3 and A5 are the same.

1. Compute distinct values

2. Equal grouping

3. Grouping and summarizing

Hi，You need to fill in the Username and Email！