Set-based Thinking in esProc

esProc applies set theory in a uniquely deep, common way, compared with traditional programming languages. The sequence in esProc is in essence a type of set. It’s important to learn to think in sets all the time when working with esProc.

1.Sequences and sets in esProc

The sequence, as well as integer and string, is one of the most basic esProc data types. A variable value, the result of an expression and the return value that a function returns could all be a sequence.

For set type data, esProc provides basic operators for performing intersection, concatenation, union and difference operations between two sets A and B, respectively A^B，A|B，A&B，A\B. A deep understanding and skillful manipulation of these set operations will enable you to start approaching problems more by set thinking, fully and adroitly exploiting the supplied data to obtain a simple and easier solution.

The following is an example of using set operations to simplify code:

	A
1	=demo.query(“select * from EMPLOYEE”)
2	=A1.select(GENDER==”M”)
3	=A1.select(STATE==”California”)
4	=A2^A3
5	=A1.select(GENDER==”M” && STATE==”California”)
6	=A2&A3
7	=A1.select(GENDER==”M” \|\| STATE==”California”)
8	=A2\A3
9	=A1.select(GENDER==”M” && STATE!=”California”)

A4 and A5 use different methods to find male employees from the state of California; A6 and A7 select male employees or employees from the state of California using different approaches; A8 and A9 finds male employees from states outside of California in different ways. Note that A6 and A7, despite having got the same employee information, present records in different orders in the results.

Unlike a mathematical set, the order of members in an esProc set, like a sequence or a table sequence, matters, and an esProc set can also have members which are identical.

	A
1	[1,2,3,4]
2	[1,3,3,2]
3	=[1,2,3]==[1,3,2]

A2’s sequence has duplicate members. A3 returns a false because the two sequences are not equal, in that they have different orders.

esProc_set_based_thinking_1

Mathematically, the intersection operation and union operation are both commutative; in other words, A∩B = B∩A and A∪B= B∪A. However, this commutative property is not valid in esProc because the order of members counts. The order of members in the result sets of the intersection and union operations are determined by the order of the left operand.

	A
1	[1,2,3]
2	[3,1,5]
3	=A1^A2
4	=A2^A1
5	=A1&A2
6	=A2&A1

The results of A3, A4, A5 and A6 are separately listed below:

esProc_set_based_thinking_2

Because an esProc sequence is ordered, you can’t simply use the comparison operator == to find whether two sequences have same members, but you should use the A.eq(B) function.

	A
1	=[1,2,3]==[3,2,1]
2	=[1,2,3]==[3,2,1].sort()
3	=[1,2,3].eq([3,2,1])
4	=[1,2,3].eq([3,2,2])
5	=[1,2,2,3].eq([3,2,1,2])
6	=[1,2,2,3].eq([3,2,3,1])

A1 and A2 check if two sequences are equal. The results are:

esProc_set_based_thinking_3

A3, A4, A5 and A6 determine if two sequences have same members. Results are respectively as follows:

esProc_set_based_thinking_4

2.Arguments in loop functions

With the set data type, you can handle many operations on members of sets in a single line of code, without having to write the loop code.

	A
1	[3,4,1,3,6]
2	=A1.sum()
3	=A1.avg()
4	=A1.max()-A1.min()

A2 calculates the sum of members in the sequence. A3 calculates their average. A4 finds the difference between the maximum member and the minimum member. Their results are respectively as follows:

esProc_set_based_thinking_5

Sometimes, the loop functions perform operations not on members of the set but on the values computed based on each of the members. In this case, you can use arguments to specify an expression in which “~” represents the current member.

	A
1	[3,4,1,3,6]
2	=A1.sum(~*~)
3	=demo.query(“select * from EMPLOYEE”)
4	=A3.min(~.BIRTHDAY)
5	=A3.min(BIRTHDAY)
6	=A3.avg(age(BIRTHDAY,HIREDATE))

A2 calculates the sum of the squares of members. Here’s the result:

esProc_set_based_thinking_6

A4, A5 and A6 perform computations based on the table sequence of employee information got by A3’s query. A4 finds the birth date of the oldest employee. Here’s the result:

esProc_set_based_thinking_7

“~” in A4’s expression can be omitted. So A5 gets the same result as A4 does.

A6 calculates the average hiring age of the employees. Here’s the result:

esProc_set_based_thinking_8

The execution of an aggregate function with arguments can be divided into two steps:

1) Add a computed column according to the arguments;

2) Aggregate the column.

That is, A.f(x) = A.(x).f(). For example, A1.sum(~*~) is equivalent to A1. (~*~).sum(), in which A1. (~*~) generates a computed column, i.e. calculating the square of each member and returning the results as a sequence.

For the nested loop functions, “~” is interpreted as a member in the innermost sequence. In such cases, if you want to reference a member in an outer sequence, precede “~” with the name of the sequence.

	A
1	[1,2,3]
2	[-1,-2,-3]
3	=A1.sum(A2.sum(~*~))
4	=A1.sum(A2.sum(A1.~*~))
5	=A1.sum(~*A2.sum(~))
6	=A1.sum(A1.sum(A1.~*~))

A3 calculates the result of multiplying 3 by the sum of squares of A2’s members. A4 calculates the sum of members of the cross product of sequence A1 and sequence A2, and A5 is another way of calculating the sum of members of the cross product of the two sequences. In A6, the expression in the inner loop can’t reference members of the sequence in the outer loop, so the result is 3 times of the sum of squares of A1’s members. The following are results of A3~A6:

esProc_set_based_thinking_9

A6’s case is also applicable to the field reference where ~ is omitted. The field will be interpreted as one in an inner record sequence; if such a field cannot be found in the inner record sequence, the program will search the next outer layer.

3.The order of loop

The program cycles through members of the original sequence, performing its computation on one member after another according to the arguments. This is a feature of which we can make full use.

	A
1	[1,3,2,5,4,8,7]
2	0
3	=A1.(A2=A2+~)
4	[1,1,0,0,1,0,0,0,1,0,1,0,0,0]
5	0
6	=A4.max(if(~==0,A5=A5+1,A5=0))

By running a loop, A3 calculates a sequence of cumulative sums of members in A1’s sequence:

esProc_set_based_thinking_10

A6 calculates the maximum number of successive 0s in A4’s sequence:

esProc_set_based_thinking_11

So in many cases, you can use just a single expression to achieve what a simple loop statement can do.

4.Sequence computing

Different from such loop functions as sum and avg that return a single aggregate value, the A.(x) function calculates each member of a sequence and returns a set. Thus another way of generating a new set is sequence computing, in addition to the basic set operations such as union, intersection and difference.

	A
1	[1,2,3]
2	=A1.(~*~)
3	=A1.(~)
4	=A1.()
5	=A1.(1)
6	=A1.(if(~%2==0,~,0))
7	I love you
8	=len(A7).(mid(A7,~,1))
9	=A8.countif(;~==”o”)

The code from A2 to A6 perform computations based on A1’s sequence: A2 calculates each member’s square; both A3 and A4 list members of the sequence in order; A5 lists 1 cyclically; A6 loops through members to return 0 if the member is an odd number, but return the current member value otherwise. Their results are respectively as follows:

esProc_set_based_thinking_12

A8 splits A7’s string to form a sequence of single characters. A9 counts how many times the letter o appears. Their results are as follows:

esProc_set_based_thinking_13

The new function is used to return a table sequence obtained by sequence computing.

	A
1	[1,2,3,4,5]
2	=A1.new(~:Origin,~*~:Square)
3	=demo.query(“select * from EMPLOYEE”)
4	=A3.new(NAME,age(BIRTHDAY):Age)
5	=A3.new(NAME)
6	=A3.(NAME)

A2 returns a new table sequence by cycling through members of A1’s sequence to compute each of them. “~” in the expression represents the current member. Here’s the result:

esProc_set_based_thinking_14

Based on A3’s table sequence, A4 and A5 respectively create a two-field table sequence and single-field table sequence. Here’re the results:

esProc_set_based_thinking_15

A6 runs a loop to get a sequence composed of values of NAME field, based on A3’s table sequence. Note that the result is different from that of A5:

esProc_set_based_thinking_16

In addition, there is a run function for sequence computing that returns the original sequence instead of the result of loop computation. Generally, it is used to assign values to fields in a record sequence (or table sequence).

	A
1	=demo.query(“select * from EMPLOYEE”)
2	=A1.new(NAME,age(BIRTHDAY):Age)
3	=A2.run(Age=Age+1)

A2 creates a new table sequence where names of employees are listed and their ages are calculated. Based on this table sequence, A3 adds 1 to each employee’s age. As the run function only modifies A2’s original table sequence, it returns the same table sequence A3 creates. The following step-by-step execution shows how A2’s table sequence changes:

esProc_set_based_thinking_17

In a nested loop function, you can’t use ~, or A.~, to represent members of the same sequence in different layers of the loop. To list all possible two-letter combinations according to the letters A, B and C, for example:

	A
1	[A,B,C]
2	=A1.(A1.(A1.~+~)).conj()
3	=A1.((a=~,A1.(a+~))).conj()
4	=A1.((a=~,A1.((b=~,A1.(a+b+~))))).conj().conj()

A2 uses A.~ to represent members of the outer loop. But since the inner loop cycles the same sequence A1, actually A2’s expression is equivalent to =A1.(A1.(~+~)).conj(), which returns the following result:

esProc_set_based_thinking_18

In order to tackle the problem, you can use two parenthesis operators to enclose a series of expressions – in the form of (x₁,x2,…,x_k) – in performing a loop computation. The expressions will be evaluated in order and the function will return the result of the last expression. With the parentheses, A3 first assigns value to the program variable a in the outer loop, and then references data in the outer loop through a in the inner loop. A4 takes a step further to perform a triple-layer loop computation in which two program parameters are specified to pass data from an outer loop to the next inner loop. Here’re results of A3 and A4:

esProc_set_based_thinking_19

5.Impure sets

esProc places no restriction on the data types of the members in a sequence, allowing numbers, strings and records to get together to form a sequence.

	A
1	[1,a3,2,5.4,$[4.5],2011-8-8]
2	=[A1,4]

A1’s sequence contains members of various data types. A2’s sequence is composed of A1’s sequence and a single-value member. The following lists data in A1 and A2:

esProc_set_based_thinking_20

Most of the time in the real-world applications, however, there’s little significance to arrange data of various types in one sequence. Therefore, it is not necessary to pay too much attention to it.

Yet a record sequence – a sequence consisting of records – may consist of records from different table sequences. From this point of view, the feature offers concrete convenience.

	A
1	=demo.query(“select * from EMPLOYEE”)
2	=demo.query(“select * from FAMILY”)
3	=A1\|A2
4	=A3.count(left(GENDER,1)==”F”)

A4 counts the females in employees and their families. You can perform the computation provided that both employee table and family table contain a GENDER field, regardless of their data structures.

In esProc, it doesn’t matter that records in a record sequence originate from different table sequences. As long as they have fields with the same names, the records can be processed uniformly. This enables simpler program writing, higher efficiency and less memory usage. In SQL, however, two tables with different structures must be united into a new one by using the UNION clause before any operations can be performed.

6.Sets consisting of subsets

Since esProc lets any object to be a member of a set, a set itself can be a member of a larger set. If A is a set consisting of other sets, you can use functions A.conj(), A.union(), A.diff(), and A.isect() to compute concatenation, union, difference and intersection of the subsets of A.

	A
1	[[1,2,3,4,5],[1,3,5,7,9],[2,3,5,7]]
2	=A1.conj()
3	=A1.isect()
4	=A1.(~.sum())
5	=A1.(~.(~*~))

A1 is a sequence comprising other sequences. According to A1’s sequence, A2, A3, A4 and A5 respectively obtain a concatenation sequence, an intersection sequence, a sequence containing sums of members in the sub-sequences, and a sequence composed of sequences generated each by calculating the squares of members in each sub-sequence. Their results are as follows:

esProc_set_based_thinking_21

A record sequence can also be a member of a sequence.

	A
1	=demo.query(“select * from EMPLOYEE”)
2	=A1.select(STATE==”California”)
3	=A1.select(STATE==”Indiana”)
4	=A1.select(STATE==”Florida”)
5	=[A2,A3,A4]
6	=A5.(~.count())
7	=A5.(~(1).STATE)
8	=A5.(STATE)
9	=A5.new(STATE,~.count():Count)

A2, A3 and A4 respectively retrieve records of employees from California, Indiana and Florida. A5 generates a sequence consisting of these three record sequences. It is a set of sets.

esProc_set_based_thinking_22

A6 counts the employees from each of the three states. Here’s the result:

esProc_set_based_thinking_23

A7 gets names the states. ~(1) in the expression can be omitted, so expressions in A8 and A7 are equivalent and they have the same results:

esProc_set_based_thinking_24

A9 creates a table sequence, based on A5’s sequence, to count the employees from each state. Here’s the result:

esProc_set_based_thinking_25

7.Understanding group operation

Data grouping is a common SQL operation. But its real meaning is still far from being fully grasped by most of us. In essence, grouping data is splitting a set into multiple subsets according to a certain rule. In other words, the return value of a group operation is a set consisting of subsets. Often what may be of interest to people isn’t the set itself, but certain aggregate values of its subsets. Hence, group operations are often followed by aggregate operations on subsets.

This is the very way that SQL handles data grouping. SQL GROUP BY clause is always followed by an aggregate operation. Another reason for SQL to impose an aggregate operation on a group operation is that it hasn’t the explicit set data type for returning a set containing subsets directly.

This explains why one has got used to the conception that the group operations should go hand in hand with the aggregate operations and forgotten that they are in fact two independent operations.

However, sometimes you might still be interested in the subsets, instead of the aggregate values. Even if we might only care about the aggregate values, the subsets are worth retaining for reuse in subsequent computations, rather than being discarded once an aggregate operation is completed and thus being re-generated if needed.

This requires restoring the true sense of group operations. With set-based thinking fully embedded, esProc has achieved this requirement. It separates the aggregate operations from the group operations in its basic group functions.

	A	B
1	=demo.query(“select * from EMPLOYEE”)
2	=A1.group(month(BIRTHDAY),day(BIRTHDAY))	/Group by Birthday
3	=A2.select(~.len()>1)	/Employees with same birthday as others
4	=A3.conj()
5	=A1.group(STATE)	/Group by State
6	=A5.new(~(1).STATE:State,~.count():Count)	/Use the grouping result to create a TSeq and counts the employees from each state
7	=A5.new(STATE,~.avg(interval@y(BIRTHDAY,now())):Age)	/Use the grouping result again to create another TSeq and calculates the average age of employees from each state

The grouping result is a set of subsets, and the subsets can still be grouped. Each member of the grouping result is also a set, which can also be further grouped. Both operations will produce a multilayer set.

	A	B
1	=demo.query(“select * from EMPLOYEE”)
2	=A1.group(year(BIRTHDAY))	/Group by the year of birth
3	=A2.group(int(year(~(1).BIRTHDAY)%100/10))
4	=A2.group(int(year(BIRTHDAY)%100/10))
5	=A2.(~.group(month(BIRTHDAY)))	/Group members of the grouping result; A3,A4 and A5 respectively return a sequence consisting of RSeqs

Because these results have too many layers, they are rarely used in real-world business. The example is cited only to show the set-based thinking pattern and the nature of the set operations.

esProc group function groups data and sorts the grouping result according to the grouping expression. For example:

	A	B
1	$ select EID,NAME+’ ‘+SURNAME FULLNAME, DEPT from EMPLOYEE
2	=A1.group(DEPT)	=A2.new(~.DEPT:DEPT,~.count():Count)
3	=A2.sort(~.DEPT:-1)	=A3.new(~.DEPT:DEPT,~.count():Count)
4	=A1.group@u(DEPT)	=A4.new(~.DEPT:DEPT,~.count():Count)
5	=A1.group@o(DEPT)	=A5.new(~.DEPT:DEPT,~.count():Count)

A1 gets a table sequence as follows:

esProc_set_based_thinking_26

A2 groups employee records by department. By default, A2 will sort the grouping result by department names in alphabetically ascending order. For the sake of viewing, B2 counts how many employees in each department after the data grouping. Here’re results of A2 and B2:

esProc_set_based_thinking_27

A3 re-sorts A2’s grouping result by department in descending order. B3 also calculates the number of employees in each department. Here’re results of A3 and B3:

esProc_set_based_thinking_28

Apart from re-sorting the grouping result, you can change the orders of the resulting groups by adding certain options to the group function. By using @u option, A4 is able to produce groups of departments that are arranged in order of their appearance in the original employee table. With @o option, A5 simply places the neighboring records with the same department values into one group without first sorting the records, resulting in duplicate department groups. B4 and B5 count the employees in each group for the two group operations respectively, and here’re the results:

esProc_set_based_thinking_29

8.Non-equi grouping

Besides common group functions, esProc also provides align@a() function for performing alignment grouping and enum() function for enumeration grouping.

The data grouping implemented by the group function is called equi-grouping, which has the following features:

1) Any member in the original set must and can only be in one subset;

2) There is no empty subset.

Both the alignment grouping and the enumeration grouping haven’t the two features.

Alignment grouping is an operation that calculates the grouping expression with members of a set and groups the result set by mapping members with values of a specified sequence. To perform the alignment grouping, the following steps are required:

1) Specify a set of values;

2) Put members in the set to be grouped whose certain attribute matches one of the specified values to one subset;

3) Each resulting subset must correspond to a pre-defined value.

It is possible that a certain member exists in none of the subsets, or that an empty set appears, or that a member appears in two subsets.

The following example groups records of employees by a specified sequence of states:

	A
1	=demo.query(“select * from EMPLOYEE”)
2	[California,Florida,Chicago]
3	=A1.align@a(A2,STATE)

A3 performs alignment grouping according to A2’s sequence. It maps the name of each state with members in A2. With this type of grouping, it is probably that certain employees don’t belong to any of the groups, or that empty groups containing not a member appear. A3’s result is as follows:

esProc_set_based_thinking_30

Enumeration grouping is defined as follows: First, specify a set of conditions, take members in the set to be grouped as arguments to evaluate the conditional expressions, and group members satisfying different conditions into different subsets, each of which corresponds to one of these pre-defined conditions. Maybe a certain member can’t be placed any of these subsets, or can be classified in two subsets at the same time; and an empty set may appear.

The following example groups records of employees by specified age groups:

	A
1	=demo.query(“select * from EMPLOYEE”)
2	[?<=30,? <=40,?>40]
3	=A1.enum(A2,age(BIRTHDAY))
4	[?<=30,?>20 && ?<=40,?>50]
5	=A1.enum@r(A4,age(BIRTHDAY))

A3 performs enumeration grouping according to A2’s conditional sequence. An employee record in A1 can be in none of the groups but is not allowed to exist in two groups at the same time. A3’s result is as follows:

esProc_set_based_thinking_31

In this case, an employee may be put into the group satisfying the first condition. Though employees who are 30 or below are at the same time satisfies the condition of 40 or below, they will not be distributed to the second group.

A5 also performs enumeration grouping according to A2’s conditional sequence. Here @r option is used to allow repeated grouping. So an employee record may appear in more than one group. A4’s result is as follows:

esProc_set_based_thinking_32

Though it seems that these two functions differ greatly with group function, you will understand, having the nature of group operations in mind, that they actually do the same thing – splitting a set into multiple subsets. The only difference is that they split sets in different ways.