esProc Set Thinking – Raqsoft Blog

1. Sequences and sets in esProc

Unlike traditional programming languages, esProc employs sets commonly. In fact, the sequence in esProc belongs to the field of set. So it’s quite important to deeply understand the concept of sets when using esProc. Like an integer and a string, the set is a basic esProc data type; it could be a variable value, a computation of an expression, and a return value of a function.

For this data type, esProc provides operators for operations between two sets A and B, like intersection, concatenation, union, difference: A^B，A|B，A&B，A\B. It is recommended users should deeply understand and master these set operations, in order to start approaching problems more by set thinking. The set-based thinking approach can make good use of existing data to bring about an easier solution.

The following is an illustration of using set operations to simplify code:

	A
1	=demo.query(“select * from EMPLOYEE”)
2	=A1.select(GENDER==”M”)
3	=A1.select(STATE==”California”)
4	=A2^A3
5	=A1.select(GENDER==”M” && STATE==”California”)
6	=A2&A3
7	=A1.select(GENDER==”M” \|\| STATE==”California”)
8	=A2\A3
9	=A1.select(GENDER==”M” && STATE!=”California”)

A4 and A5 use different methods to find male employees from the state of California; A6 and A7 select male employees and employees from the state of California using different approaches; A8 and A9 finds male employees from states outside of California in different ways. Note that A6 and A7, despite having got the same employee information, have different orders in which records appear in the results.

Unlike a mathematical set, a set in esProc is arranged orderly and allows duplicate members in it, like a sequence or a table sequence.

	A
1	[1,2,3,4]
2	[1,3,3,2]
3	=[1,2,3]==[1,3,2]

For instance there are duplicate membes in A2’s sequence. Members in A3’s two sequences have different orders, which makes us believe they are not equal and thus the comparison result is false:

esProc_deep_set_thinking_1

Mathematically, the intersection operation and union operation are both commutative; in other words, A∩B º B∩A and A∪B º B∪A. However, this commutative property is not valid for esProc because the member order in a sequence counts. For esProc, the result of the intersection / union operation is required to be arranged according to the order of the left operand.

	A
1	[1,2,3]
2	[3,1,5]
3	=A1^A2
4	=A2^A1
5	=A1&A2
6	=A2&A1

The computed results of A3, A4, A5 and A6 are separately listed below:

esProc_deep_set_thinking_2

Because of this ordering feature of sequence members, we should adopt A.eq(B) function to judge whether two sequences have the same members instead of simply using the comparison operator == :

	A
1	=[1,2,3]==[3,2,1]
2	=[1,2,3]==[3,2,1].sort()
3	=[1,2,3].eq([3,2,1])
4	=[1,2,3].eq([3,2,2])
5	=[1,2,2,3].eq([3,2,1,2])
6	=[1,2,2,3].eq([3,2,3,1])

A1 and A2 judge if the two sequences are equal. The results are:

esProc_deep_set_thinking_3

A3, A4, A5 and A6 judge if the two sequences have the same members. Results are respectively as follows:

esProc_deep_set_thinking_4

2. Parameters in a loop function

With the data type of set, you can handle many operations on members of sets in a single line of code, without the need of loop code.

	A
1	[3,4,1,3,6]
2	=A1.sum()
3	=A1.avg()
4	=A1.max()-A1.min()

A2 calculates the sum of members in the sequence; A3 calculates their average; and A4 finds the difference between the maximum value and the minimum one. Their results are respectively as follows:

esProc_deep_set_thinking_5

Sometimes, the loop functions will not process the set members but the values computed based on these members. In this case, you can use parameters to construct an expression in a function, in which “~” represents the current member.

	A
1	[3,4,1,3,6]
2	=A1.sum(~*~)
3	=demo.query(“select * from EMPLOYEE”)
4	=A3.min(~.BIRTHDAY)
5	=A3.min(BIRTHDAY)
6	=A3.avg(age(BIRTHDAY,HIREDATE))

A2 calculates the sum of every sequence member’s square. The result is:

esProc_deep_set_thinking_6

A4, A5 and A6 perform computations based on the table sequence of employee information queried by A3. A4 finds the birth date of the oldest employee. The result is:

esProc_deep_set_thinking_7

“~” in A4’s expression can be omitted; A5 gets the same result as A4 does.

A6 calculates the average hire age of all the employees. The result is:

esProc_deep_set_thinking_8

For the nested loop functions, “~” is interpreted as a member in an inner sequence; in such cases, if you want to reference a member in an outer sequence, precede “~” with the name of the outer sequence.

	A
1	[1,2,3]
2	[-1,-2,-3]
3	=A1.sum(A2.sum(~*~))
4	=A1.sum(A2.sum(A1.~*~))
5	=A1.sum(~*A2.sum(~))
6	=A1.sum(A1.sum(A1.~*~))

A3 calculates the result of multiplying 3 by the sum of squares of A2’s members. A4 calculates the sum of cross products of members in sequence A1 and sequence A2; A5 is another way of calculating the sum of cross products of members of the two sequences. In A6, the expression in the inner loop fails to reference members of the sequence in the outer loop, so the result is 3 times of the sum of squares of A1’s members. The following are results of A3~A6:

esProc_deep_set_thinking_9

The above rule is also applicable to the field reference where ~ is omitted. In such a case, a field will be first considered as one in an inner record sequence; if such a field cannot be found in the inner record sequence, esProc will search for it in the outer record sequence.

3. Order of loop

The arguments in a loop function will compute according to order of members in the original sequence. This is a feature of which we can make full use.

	A
1	[1,3,2,5,4,8,7]
2	0
3	=A1.(A2=A2+~)
4	[1,1,0,0,1,0,0,0,1,0,1,0,0,0]
5	0
6	=A4.max(if(~==0,A5=A5+1,A5=0))

By running a loop, A3 calculates a sequence of cumulative sums of members in A1’s sequence:

esProc_deep_set_thinking_10

A6 calculates the maximum number of successive 0s in A4’s sequence:

esProc_deep_set_thinking_11

In many cases, we can use just a single expression to implement the function which can be achieved by simple loop code.

4. Sequence computing

Different from such loop functions as sum and avg that return a single aggregate value, the A.(x) function for computation on a sequence returns a set. Often, a new set can be created by sequence computing, in addition to using the basic set operations such as union, intersection and difference.

	A
1	[1,2,3]
2	=A1.(~*~)
3	=A1.(~)
4	=A1.()
5	=A1.(1)
6	=A1.(if(~%2==0,~,0))
7	I love you
8	=len(A7).(mid(A7,~,1))
9	=A8.countif(;~==”o”)

The code from A2 to A6 perform computations based on A1’s sequence: A2 calculates each member’s square; both A3 and A4 list sequence members sequentially; A5 lists 1 cyclically; In A6’s loop computation, if the member in A1 is an odd number, return 0; otherwise return the member value. Their results are respectively as follows:

esProc_deep_set_thinking_12

A8 presents A7’s string as a sequence of single characters; A9 counts the number of times letter o appears. Their results are:

esProc_deep_set_thinking_13

The execution of an aggregate function with arguments can be divided into two steps:

1) Use arguments to produce computed column;

2) Aggregate the result.

That is, A.f(x) = A.(x).f()

The new function is used to return a table sequence by sequence computing.

	A
1	[1,2,3,4,5]
2	=A1.new(~:Origin,~*~:Square)
3	=demo.query(“select * from EMPLOYEE”)
4	=A3.new(NAME,age(BIRTHDAY):Age)
5	=A3.new(NAME)
6	=A3.(NAME)

A2 returns a new table sequence by looping through and computing members of A1’s sequence. “~” in the expression represents the sequence member on the current loop. The result is:

esProc_deep_set_thinking_14

Based on A3’s table sequence, A4 and A5 respectively create a new table sequence with two fields and one with only one field. Their results are:

esProc_deep_set_thinking_15

A6 runs a loop to get a sequence composed of values of NAME field, based on A3’s table sequence. Note that the result is different from that of A5:

esProc_deep_set_thinking_16

In addition, there is a run function for computing sequence. This function returns the original sequences themselves instead of the result of alignment computation. Generally, it is used to assign values to fields in a record sequence (table sequence).

	A
1	=demo.query(“select * from EMPLOYEE”)
2	=A1.new(NAME,age(BIRTHDAY):Age)
3	=A2.run(Age=Age+1)

A2 creates a new table sequence, enters names of employees in it and calculates their ages. Based on this table sequence, A3 adds 1 to each employee’s age. As it is the data of original table sequence in A2 that run function modifies, the results of A2 and A3 are the same. The following step-by-step execution shows how A2’s table sequence changes:

esProc_deep_set_thinking_17

5. Impure sets

esProc has no restriction on the consistency of sequence member types, that is, a sequence may consist of numbers, strings, and records.

	A
1	[1,a3,2,5.4,$[4.5],2011-8-8]
2	=[A1,4]

A1’s sequence includes members of various data types. A2’s sequence is composed of A1’s sequence and a single-value member. The following lists data in A1 and A2:

esProc_deep_set_thinking_18

However, in many cases, it would be of little practical significance to arrange data of various types in one sequence. Therefore, it is not necessary to pay too much attention to it.

On the other hand, a record sequence – a sequence consisting of records – may consist of records from different table sequences; in that case, this feature offers concrete convenience.

	A
1	=demo.query(“select * from EMPLOYEE”)
2	=demo.query(“select * from FAMILY”)
3	=A1\|A2
4	=A3.count(left(GENDER,1)==”F”)

A4 counts the number of females in employees and their families. We can perform the computation provided that both employee table and family table contain GENDER field, regardless of their different data structures.

Under esProc, it is not necessary that records in a record sequence originate from the same table sequence. As long as they have the same field names, the records can be processed uniformly. Here, merits of esProc include simpler program writing, higher efficiency and reduced occupation of memory. For SQL, however, two tables with different structures must be united into a new one by using the UNION clause before performing any operations.

6. Sets consisting of subsets

Since the member of a set in esProc is less restrictive by nature, a set may be a member of another set. If A is a set consisting of other sets, functions A.conj(), A.union(), A.diff(), A.isect() could be employed to compute concatenation, union, difference and intersection between subsets of A.

	A
1	[[1,2,3,4,5],[1,3,5,7,9],[2,3,5,7]]
2	=A1.conj()
3	=A1.isect()
4	=A1.(~.sum())
5	=A1.(~.(~*~))

A1 is a sequence consisting of other sequences. According to A1’s sequence, A2, A3, A4 and A5 respectively obtain a concatenation sequence, an intersection sequence, a result sequence containing sums of members in every sub-sequence, and a new sequence including multiple sequences, members of each of which are squares of each member in every original sub-sequence. These result sequences are as follows:

esProc_deep_set_thinking_19

Also, a record sequence may be a member of a sequence.

	A
1	=demo.query(“select * from EMPLOYEE”)
2	=A1.select(STATE==”California”)
3	=A1.select(STATE==”Indiana”)
4	=A1.select(STATE==”Florida”)
5	=[A2,A3,A4]
6	=A5.(~.count())
7	=A5.(~(1).STATE)
8	=A5.(STATE)
9	=A5.new(STATE,~.count():Count)

A2, A3 and A4 respectively retrieve data of employees from California, Indiana and Florida. A5 generates a sequence consisting of these three record sequences. It is a set of sets.

esProc_deep_set_thinking_20

A6 counts the number of employees from the three states, with the following result:

esProc_deep_set_thinking_21

A7 gets name of each of the states. ~(1) in the expression can be omitted, so expressions in A8 and A7 are equal and their results are the same:

esProc_deep_set_thinking_22

A9 creates a table sequence, based on A5’s sequence, to count number of employees from each state. The result is:

esProc_deep_set_thinking_23

7. Understanding group operation

The group operation is commonly used in SQL. However, not everyone can grasp its essence. By nature, it is to split a set into multiple subsets according to a certain rule. In other words, the return value of a group operation shall be a set consisting of subsets. However, people often do not need to use such a set; instead, they may need to view a part of aggregate values of its subsets. Therefore, group operations are often followed by aggregate operations on subsets.

This is what SQL does. The GROUP BY clause is always followed by an aggregate operation. Because there is no explicit set data type in SQL, a set consisting of subsets cannot be returned directly. Therefore, there is no other choice but to append aggregate operations to group operations.

This explains why one has got used to the conception that the group operations should go hand in hand with the aggregate operations and forgotten that they are in fact two independent operations.

However, sometimes we would still be interested in these grouped subsets, instead of the aggregate values. Even if we might only be interested in the aggregate values, these subsets are worth retaining for reuse in subsequent computations, rather than being discarded once an aggregate operation is completed and being re-generated if needed.

This requires us to understand group operations in its true sense. With set thinking embedded, esProc can achieve this goal. In fact, the basic group functions in esProc are used only for data grouping by removing the aggregate operations.

	A	B
1	=demo.query(“select * from EMPLOYEE”)
2	=A1.group(month(BIRTHDAY),day(BIRTHDAY))	/Group by Birthday
3	=A2.select(~.len()>1)	/Employees with same birthday as others
4	=A3.conj()
5	=A1.group(STATE)	/Group by State
6	=A5.new(~(1).STATE:State,~.count():Count)	/Use the group result to create a TSeq and counts number of employees from each state
7	=A5.new(STATE,~.avg(interval@y(BIRTHDAY,now())):Age)	/Use the group result again to create another TSeq and calculates the average age of employees from each state

The group result is a set consisting of subsets, and the subset can be grouped too. Members in the group result are also sets, which can also be grouped. Both of the two operations will produce a multilayer set.

	A	B
1	=demo.query(“select * from EMPLOYEE”)
2	=A1.group(year(BIRTHDAY))	/Group by the year of birth
3	=A2.group(int(year(~(1).BIRTHDAY)%100/10))
4	=A2.group(int(year(BIRTHDAY)%100/10))
5	=A2.(~.group(month(BIRTHDAY)))	/ Group members of the group result; A3/A4/A5 returns a sequence consisting of record sequences

Because these results have too many layers, they are rarely used in practice. The above example is only to show you the set-based thinking pattern and the nature of the set operations.

8. Non-equal grouping

Besides the conventional group function, esProc provides align@a() function for perfoming alignment grouping and enum() function for performing enumeration grouping.

The data grouping implemented by the group function is called equal grouping with the following features:

1) Any member in the original set must and can only be in one subset;

2) There is no empty subset.

The above two features are unavailable for the alignment grouping or the enumeration grouping.

Alignment grouping is an operation that calculates grouping expressions with members in a set and matches the result subsets with values in a specified sequence. For the alignment group, the following steps are required:

1) Specify a group of values;

2) In the set to be grouped, put members whose certain property values are equal to the specified values to one subset;

3) Each result subset must correspond to a pre-defined value.

It is possible that a certain member exists in none of the subsets, or an empty set appears, or a member exists in both subsets.

The following case will group employees by specified states sequence:

	A
1	=demo.query(“select * from EMPLOYEE”)
2	[California,Florida,Chicago]
3	=A1.align@a(A2,STATE)

A3 performs alignment grouping by A2’s sequence. It matches name of each state with members in A1. This type of grouping may produce the result that certain employees belong to none of the groups, or empty groups that contain not a member at all. A3’s result is as follows:

esProc_deep_set_thinking_24

Enumeration grouping is defined as follows: First, specify a group of conditions, take the members in the set to be grouped as arguments and compute these conditions, members satisfying these conditions will be grouped into a subset; each subset is corresponding to a pre-defined condition. A member may be in none of these subsets, or in two subsets at the same time; besides, an empty set may appear.

The following case will group employees by specified age groups:

	A
1	=demo.query(“select * from EMPLOYEE”)
2	[?<=30,? <=40,?>40]
3	=A1.enum(A2,age(BIRTHDAY))
4	[?<=30,?>20 && ?<=40,?>50]
5	=A1.enum@r(A4,age(BIRTHDAY))

A3 performs enumeration grouping by A2’s conditional sequence. An employee record in A1 can be be in none of the subgroups but is not allowed to exist in two subgroups. A3’s result is as follows:

esProc_deep_set_thinking_25

In this case, an employee may be put into the group satisfying the first condition. Though employees who are below 30 satisfy the condition that employees who are 40 or below, they will not be distributed to the second group.

A5 also performs enumeration grouping by A2’s conditional sequence. Here @r option is used to allow repeated grouping. So an employee record may appear in more than one group. A4’s result is as follows:

esProc_deep_set_thinking_26

Although it seems that these two functions differ greatly with group function, you will understand, having the nature of group operations in mind, that the three functions actually do the same thing – to split a set into multiple subsets. The only difference is that they split sets in different ways.