In data analysis, comparison operations are employed to check whether a datum is greater/less or equal to another datum, as well as to perform operations including query, select, sort and group. Here we discuss the uses of comparison operations in esProc and solve possible related problems.

## 1.Comparing single values

esProc supports various data types such as integer, long integer, floating point number, big decimal, boolean, string, date, time and datetime. We can compare datum ** a** and datum

**of any of these types using**

*b***cmp(**function. For example:

*a*,*b*)A | B | |

1 | =cmp(1234,1243) | =cmp(3.145,3.142) |

2 | =cmp(6/2,pi()) | =cmp(true,true) |

3 | =cmp(“New Jersey”,”New York”) | =cmp(date(2016,3,1),date(2016,2,29)) |

Generally the comparison happens between data of the same type. But, integers, long integers, floating point numbers and big decimals are all real numbers, so they can be compared freely with each other. Below are results of A1, B1, A2, B2, A3 and B3:

If the value of** cmp( a,b)** is

**1**,

**is greater than**

*a***, i.e.**

*b***.**

*a>b*If the value of** cmp( a,b)** is

**-1**,

**is less than**

*a***, i.e.**

*b***.**

*a<b*If the value of** cmp( a,b)** is

**0**,

**is equal to**

*a***, i.e.**

*b***. Note that in this case the comparison operator is a double equal signs. The single equal sign operator is for assigning value to the parameter.**

*a==b*Similarly, **cmp( a,b)>=0 **means

**is greater than or equal to**

*a***, i.e.**

*b***a**>=

**b**.

**cmp( a,b)<=0 **means

**is less than or equal to**

*a***, i.e.**

*b***a**<=

**b**.

**cmp( a,b)!=0 **means

**is not equal to**

*a***, i.e.**

*b***a**!=

**b**.

As can be seen from the cellset results in the above, comparing real numbers is comparing their values. Also, the **cmp()** function can compare the results of two expressions. For string comparisons, the function will compare the ASCII values of each pair of characters in order until a pair with different values appears. The comparison is irrelevant to the lengths of the strings. If all their characters are identical, the two strings are equal. In comparing two date/time/datetime values, the one located at a later point in a timeline is greater than the other located at an early point.

The above cellset values can be expressed using the comparison operators in the following ways:

A | B | |

1 | =1234<1243 | =3.145>3.142 |

2 | =6/2<pi() | =true==true |

3 | =”New Jersey”<“New York” | =date(2016,3,1)>date(2016,2,29) |

Expressions like ** a>b** and

**are boolean expressions. They return true or false depending on whether the conditional relationship between the two operands holds or not. In the above cellset, all boolean expressions in cells from A1 to B3 hold and thus return true:**

*a==b*The date/time/date/time data needs to be compared in a special way. If we just want to find if the data is in the same day or the same month, use the **deq()** function. For example:

A | B | |

1 | =deq@y(date(2016,3,1),date(2016,2,29)) | =deq@m(date(2016,3,1),date(2016,3,15)) |

With **@y** option, the function determines that the two objects are equal as long as they are in the same year. Using **@m** option, it finds whether the two objects are in the same month. Below are results of A1 and A2:

Except various types of real numbers, esProc forbids comparisons between different types of data. For example:

A | |

1 | =cmp(“New Jersey”,1234) |

In this case the error information appears and the computation terminates:

To compare data having different data types, first a type conversion should be performed. For example:

A | B | |

1 | =cmp(“New Jersey”,string(1234)) | =cmp(long(date(2016,3,1)),12345678987654L) |

2 | =cmp(true,bool(1)) | =cmp(true,bool(“New Jersey”)) |

For type conversion, a date/time/datetime value can be converted to a long integer, which indicates the number of milliseconds between the value itself and 0:00:00 January 1, 1970 GMT. Any real number can be converted to a boolean value **true**; and except the string “**false**” that will be converted to **false**, any other string will be converted to **true**.

## 2.Comparing sequences

A sequence has members. By comparing members of two sequences, we can perform locate, select and sort operation. For example:

A | B | |

1 | [Rebecca Moore,Ashley Wilson,Rachel Johnson,Ryan Williams,Richard] | |

2 | =A1.pos(“Ashley Wilson”) | =A1.select(~>”Re”) |

3 | =A1.sort() | =A1.sort(right(~1)) |

A2 locates the position of Ashley Wilson in the sequence. B2 selects members that are greater than “Re”. Below are results of A2 and B2:

A3 sorts members of the sequence in ascending order. B3 sorts the members by comparing the last letters of members. Below are results of A3 and B3:

The sequence comparison is similar to string comparison. Members of two sequences will be compared in alignment until different members are found. The result of comparing these two different members is the result of sequence comparison. If every two members with the same position are equal, then the two sequences are equal. For example:

A | B | C | |

1 | [1,2,3,4,5] | [1,2,3,5] | [5,3,2,1,4] |

2 | =cmp(A1,B1) | =cmp(A1,C1) | |

3 | =A1<B1 | =A1==C1 | =A1.eq(C1) |

A2 and B2 compare the two sequences using **cmp()** function. Here are the results:

In comparing the sequences in A1 and B1, their first three members are same, but the fourth members are respectively 4 and 5. So the returned result is -1, without the need of considering their lengths. Members in both A1 and C1 are numbers from 1 to 5, but they have different orders. They are regarded as unequal when compared.

As with single value comparison, the relationship of two sequences ** A** and

**can be represented by**

*B***A**>

**B**,

**A**<

**B**,

**A**==

**B**,

**A**<=

**B**,

**A**>=

**B**and

**A**!=

**B**. Since both A2 and B have the results of -1, they can be expressed as A1<B1 and A1<C1 respectively. Thus A3 and B3 return the following values:

To determine if sequence ** A** and sequence

**contain the same members, use**

*B***A**.eq(

**B**) function. C3 gets a result as follows:

Since a sequence allows its members to use different data types, members of the two sequences in the same position need to have comparable data types. For example:

A | B | C | |

1 | [1,2,3,4,5] | [1,3,two] | [1,two,3] |

2 | =cmp(A1,B1) | =cmp(A1,C1) |

A2’s operation works normally because the comparison finishes at the second members. But when trying to compare A1 and C1, because their second members **2** and **two** are of different data types, they can’t be compared and the error information appears to terminate the computation:

## 3.Comparing records

We can perform operations such as locate, select and sort on a sequence by comparing each of its members with a given condition. In the same way, we can perform these operations on a table sequence by doing the same with the records. In data processing, the comparison of records usually involves only certain fields. For example:

A | B | |

1 | $ select NAME,ABBR,POPULATION from STATES | =A1.sort(POPULATION:-1) |

2 | =B1(1) | =B1(2) |

3 | =cmp(A2.POPULATION,B2.POPULATION) | =cmp(A2.ABBR,B2.ABBR) |

4 | =cmp(A2,B2) | =A1.sort(~) |

A1 and B1 contain respectively a table sequence generated from the demo database and a record sequence sorted by population in descending order:

A2 and B2 respectively retrieve the state record with the largest population and the one with the second largest population:

Actually B1 sorts records by comparing every two of the records’ POPULATION field values. A3 and B3 compare the POPULATION fields and ABBR fields of the two retrieved records. Here’re the results:

As can be seen, different results may be obtained by comparing different fields of the same two records.

A4 compares the two records themselves. B4 sorts A1’s records in ascending order by comparing every two of them. Their results are as follows:

That A4 returns a result means records themselves can be compared. But by examining B4’s result, we can see that records aren’t compared according to any of the fields and the result seems disordered. In fact, without specifying the sorting field(s), record sorting makes no sense and simply causes disorder. esProc, however, compares two records in a table sequence based on their intrinsic hash values.

Though it’s meaningless to purely compare two records, we can still use the comparison to check whether they are equal from a particular point of view so that we can realize some operations like grouping. For example:

A | B | |

1 | $ select STATEID,NAME,ABBR from STATES | $ select CID,NAME,STATEID as STATE from CITIES |

2 | =B1.switch(STATE,A1) | |

3 | =A2.group(STATE) | =A3.new(STATE.STATEID:SID,STATE.ABBR:State, ~.(NAME):Cities,~.count():Count) |

4 | =A2.group(STATE.ABBR) | =A4.new(STATE.STATEID:SID,STATE.ABBR:State, ~.(NAME):Cities,~.count():Count) |

A1 and B1 retrieve the state information and city information separately from the demo database. A2 associates the STATE field with the records in the *states* table sequence using the **switch** function. Here’s A2’s result:

A3 groups the city records by state to learn more about the cities in each state. The group operation is performed by directly comparing the state records. In B3, A3’s grouping result is presented clearly and in detail, as shown below:

Because the grouping is carried out by simply comparing the records according to their intrinsic hash values, the result is chaotic but is achieved faster.

To make the grouping result ordered, we should specify the grouping criterion. For example, A4 groups records by ABBR field. Its result is presented by B4 as follows:

With the grouping criterion, the resulting records will be ordered by a certain desired field, like the state abbreviation.