In Related Computing in esProc – Data Grouping and Summarizing, we learned how to group data in a table as needed and to summarize the grouped data. Please note that, in data grouping and summarizing, data will be grouped by a single or multiple fields or an expression in ascending order based on equivalent values. Alignment grouping or enumeration grouping will be used in data analysis when sometimes data need to be sorted according to a specified order, or to be grouped by specified conditions.
1. Alignment
Usually sort function is used to sort data in a sequence or a table sequence, with a result being listed either in ascending order or descending order. If data need to be sorted in a specified order, esProc provides align function. For example:
A | B | |
1 | [three,one,four,six,two] | [one,two,three,four,five,six] |
2 | =A1.sort() | =A1.align(B1) |
3 | =demo.query(“select EID,NAME,GENDER, STATE,DEPT from EMPLOYEE”) | [California,Texas,Florida,Illinois] |
4 | =A3.align(B3,STATE) |
The sequence in A1 is composed of string members, and B1 provides a base sequence for alignment:
Both A2 and B2 re-sort members of the sequence in A1 according to the sequence in B1. A2 uses the common sort function, while B2 uses align function. Results of A2 and B2 are as follows after computing:
sort function only sorts the members alphabetically in ascending or descending order, while align function sorts them according to the positions of members of the base sequence. If there is not a corresponding one for a member in the base sequence, the corresponding position will be null.
align function is generally used to sort records of a table sequence or a record sequence. For example, A3 selects some employees, and A4 gets from them the employees of some states according to the specified order. The data in A3 are as follows:
According to the order of the sequence in B3, A4 gets the employees of some states:
With the alignment operation, the first employees of these states have been retrieved and a record sequence composed of these records has been returned.
2. Alignment grouping
In the previous example, only the first eligible row for each state has been got after the alignment operation. In many cases, however, we need to get all the eligible data. This requires the alignment grouping which can be realized with align@a. For example:
A | B | |
1 | =demo.query(“select EID,NAME,GENDER, STATE,DEPT from EMPLOYEE”) | [California,Texas,Florida,Illinois] |
2 | =A1.align@a(B1,STATE) | |
3 | =A2.new(STATE,~.count(GENDER==”M”):Male,~.count(GENDER==”F”):Female) | |
4 | =A1.align@n(B1,STATE) |
By adding @a option to the align function, A2 gets all eligible records for each state:
The return value of alignment grouping is a sequence consisting of all the groups, as the result of A2, which is a sequence composed of sub-sequences. Similar to the result of data grouping using group function, the result of alignment grouping using align@a is often used for further computation. For example, A3 further summarizes the data in A2 to get the number of male and female employees:
In alignment grouping, there may be some data that cannot be matched with members of the base sequence. For this problem, esProc provides another option for alignment grouping – align@n, the use of which will put all members that haven’t corresponding ones in the base sequence into a separate group. As the grouping result of A4 shown below:
By comparing the results of alignment grouping of A2 and A4, we can notice that the result of A4 has one more group which stores the records the employees who are not from the four specified states.
The base sequence for alignment grouping can be got by computing. For example:
A | |
1 | =demo.query(“select EID,NAME,GENDER, STATE,DEPT from EMPLOYEE”) |
2 | =A1.groups(STATE;count(~):Count) |
3 | =A2.sort(Count:-1) |
4 | =A1.align@a(A3:STATE,STATE) |
A2 computes the number of employees of each state using groups function for data grouping and summarizing, and then A3 sorts the result by the number of employees in descending order. Results of A2 and A3 are as follows:
A4 gets the states from the result of A3 and computes the base sequence for alignment. The result is as follows:
Another method of alignment grouping is to divide data into n groups using P.align(n,y), and then directly compute the group numbers corresponding to these groups according to the grouping expression y. For example:
A | |
1 | =demo.query(“select EID,NAME,GENDER, STATE,DEPT from EMPLOYEE”) |
2 | =A1.align@a(26,asc(left(NAME,1))-64) |
3 | [HR,R&D,Finance,Marketing] |
4 | =A1.align@a(A3.len(),A3.pos(DEPT)) |
In the expression =A1.align@a(26,asc(left(NAME,1))-64) in A2, left(NAME,1) gets the initials of the names of the employees, which are converted by asc function into ASCII code, and, with the subtraction of 64 from it, the initials will correspond to the group numbers from 1 to 26. When the code for alignment grouping is executed, the following result will be got:
The employees in the same group have the same initials.
Sometimes position functions, like pos and pmax, are used to compute the group numbers while P.align(n,y) is used to perform alignment grouping. A4 groups the employees in alignment according to the sequence of specified departments. The result is as follows:
With group function for data grouping, each group of data will be sorted by values of the field for grouping in ascending order; while the alignment grouping can produce a result sorted in specified order.
3. Find satisfied condition(s)
Sometimes data are to be grouped not by the equivalent values but by some conditions, like the range of the order amount and the employees’ age groups.
The grouping conditions can be represented by strings, like “?>3“, “[2,3,5,7].pos(?)>0“, and etc. penum function is used to judge which condition(s) a number can satisfy in a sequence formed by strings of conditional expressions. For example:
A | |
1 | [?>=85,?>=70,?>=60] |
2 | =A1.penum(100) |
3 | =A1.penum(66) |
4 | =A1.penum(54) |
There is a sequence consisting of strings of conditional expressions in A1:
Respectively, A2 and A3 find the condition(s) 100 and 66 satisfy. Results are as follows:
As can be seen from the result, when a certain number satisfy multiple conditions, only the sequence number of the first satisfied condition will be returned. 100, for example, can satisfy all the three conditions, but the returned result is 1.
As the number 54 in the expression in A4 doesn’t satisfy any of the conditions, the returned result will be null.
Similar to the alignment grouping, penum function also uses @n option, which returns k+1 when data don’t satisfy any of the k conditions. In particular, if a conditional expression is null, it can be satisfied by any data. For example:
A | |
1 | [?>=85,?>=70,?>=60] |
2 | =A1.penum@n(54) |
3 | [?>=85,?>=70,?>=60,null] |
4 | =A1.penum(54) |
Results of A2 and A4 are as follows:
The number 54 in A2 doesn’t satisfy any of the three conditions, so penum@n function is used to return a result of 4. But in A4, it satisfies the fourth condition – a null. Please note if the null is listed in the conditions, it must be put at the end.
penum function also uses @r option to represent that the satisfied conditions for different numbers can be identical. In this case, a sequence composed of the sequence numbers of all satisfied conditions will be returned. For example:
A | |
1 | [?>=85,?>=70,?>=60,null] |
2 | =A1.penum(100) |
3 | =A1.penum(66) |
4 | =A1.penum(54) |
A2, A3 and A4 return respectively a sequence composed of the sequence number(s) of the condition(s) 100, 66 and 54 satisfy. Results are as follows:
Note: Though the condition null is defined to be satisfied by any data, its sequence number won’t appear in the result if one (or above) more conditions can also be satisfied when penum@r function is doing its job. You can also notice that, when @r option works, the returned result is a sequence rather than a number even if there is only one satisfied condition.
4 . Enumeration grouping
Suppose there is a sequence E consisting of strings of conditional expression. You can group members of another sequence P using P.enum(E,y) function. The operation will compute the expression y using each member of sequence P, judge which condition(s) each computed result satisfies, and group members of P according to conditions they satisfy. By default, members of P will be grouped according to their first satisfied conditions. For example:
A | |
1 | =demo.query(“select EID,NAME,GENDER, BIRTHDAY,SALARY from EMPLOYEE”) |
2 | [?>=15000,?>=12000,?>=9000] |
3 | =A1.enum(A2,SALARY) |
A1 gets a table sequence and A2 is the sequence of enumeration conditions. They are as follows:
According to the sequence of conditions, the result of enumeration grouping in A3 is as follows:
It can be seen that the data of each group produced by enumeration grouping don’t have the same value of SALARY, but they satisfy the same condition. Similarly, though a row may satisfy both the first condition and the other two conditions, it will, by default, be put into the group where only the first one is satisfied. For the data that don’t satisfy any of the conditions, like the first and the fourth row , they won’t be grouped.
Similar to the case where penum function is used, you can append a null condition at the end, under which all data that don’t satisfy the previous conditions can be placed, while performing enumeration grouping. Or you can use enum@n, which plays a similar role. For example:
A | |
1 | =demo.query(“select EID,NAME,GENDER, BIRTHDAY,SALARY from EMPLOYEE”) |
2 | [?>=15000,?>=12000,?>=9000] |
3 | =A1.enum@n(A2,SALARY) |
4 | [?>=15000,?>=12000,?>=9000,null] |
5 | =A1.enum(A2,SALARY) |
Results of A3 and A5 are same:
It can be seen by comparing the result with that of the previous example, both the use of enum@n function and the appending of null at the end of the sequence of conditions will put all data that don’t satisfy the other conditions into a separate group.
By default, it is assumed that there are no overlapped groups when enum function is used to compute, that is, each member of P won’t satisfy two conditional expressions at the same time. @r option is needed if some members need to be grouped more than once. For example:
A | |
1 | =demo.query(“select EID,NAME,GENDER, BIRTHDAY,SALARY from EMPLOYEE”) |
2 | [?>=15000,?>=12000,?>=9000] |
3 | =A1.enum@r(A2,SALARY) |
A3 performs overlapped enumeration grouping on members using enum@r. Result is as follows:
As can be seen from the result, members of the first group appear in all the three groups and members of the second group appear repeatedly in the latter two groups after the data are overlappedly grouped.