Related Computing in esProc – Sorting and Ranking Data

esProc 926 0

We often need to sort or rank data during data statistics and analysis. The operations include problems like sort sales records according to the dates and rank the results of salespersons for assessing their performance.

1. Sort and rank members in a sequence

In esProc, functions, such as psort, sort, ranki rank, and etc. are used to sort or rank data. Now let’s look at how to realize the operations.

  A
1 [8,1,2,7,0,5,3]
2 =A1.sort()
3 =A1.psort()
4 =A1.rank()
5 =A1.ranki(5.5)

The codes from A2 to A5 sorts data and ranks data based on the sequence in A1. Results of A1 and A2 are respectively as follows:

esProc_related_sort_rank_2

In these codes, A2 sorts members in the sequence in ascending order and A3 computes the sequence numbers of the sorted members in the original sequence. For instance, the sequence number 0, which is of the smallest value, in the original sequence is 5; the sequence number 8, which is of the biggest value, in the original sequence is 1. Actually the result of the expression in A2 is the same as that of =A1(A1.psort()). So psort function can be used to generate an index sequence during data computing.

Results of A3 and A4 are respectively as follows:

esProc_related_sort_rank_3

The result of A3 is the rankings in descending order of members in the original sequence. A4 computes the ranking of 3.5 compared with each member of the original sequence.

In real business situations, it is seldom to sort data based on an existing sequence of constants. Usually the sequence waiting to be sorted is got by computing. For example:

  A B C
1 =demo.query(“select * from EMPLOYEE where EID<6”)    
2 =A1.(NAME+” “+SURNAME) =A2.sort() =A2.rank()
3 =A1.(BIRTHDAY) =A3.sort() =A3.rank()
4 =A1.(age(BIRTHDAY)) =A4.sort@z() =A4.rank@z()

For the convenience of comparison, A1 only selects the top five records of employees for computing. A2 computes their full names, B2 sorts them and C2 computes the rankings of these full names. Results of A2, B2 and C2 are as follows:

esProc_related_sort_rank_5

It can be seen that these strings are sorted alphabetically. In practice, strings are sorted and ranked according to each character’s ASCII code.

A3 computes employees’ birthdays, B3 sorts them and C3 computes the rankings of these birthdays. Results A3, B3 and C3 are as follows:

esProc_related_sort_rank_6

It can be seen that the data of date type are sorted by the date in ascending order. Please note that you can click Tool>Option to set the format of the date on page Environment. The result of sorting or ranking data of date type or date/time type is irrelevant to the format.

A4 computes the ages of employees. B4 sorts these ages using @z option, which means sorting data in descending order; C4 also uses this option to compute the rankings of these ages, meaning the rankings are ordered in ascending order. Similarly, psort function and ranki function can also use the @z option. Results of A4, B4 and C4 are as follows:

esProc_related_sort_rank_7

We can notice these two points. One is that, by adding @z option, the orders according to which sort function and rank function get their results respectively are opposite; the other is that there are duplicate values in the age data. As shown in the resulting rankings, the rankings of both 39 are 3 while the ranking of 43 becomes 5 when it should be ranked next to 39. This means a same ranking will occupy a position by default. We’ll discuss data ranking in detail in the third section.

The way that strings are sorted according to each character’s ASCII code may cause an error if the target strings are not in English language. In that case, A.sort(…;loc) is needed to specify the language with the parameter loc. For example:

  A
1 [Íker,Álvaro,Estela,Alba,César,Sancho]
2 =A1.sort()
3 =A1.sort(;”esp”)

The sequence in A1 contains names in Spanish. A2 and A3 sort them in different ways. A3 specifies “esp” as the language parameter, meaning the to-be-sorted strings are in Spanish. Results of A1, A2 and A3 are as follows:

esProc_related_sort_rank_9

As can be seen from the results, a correct sorting result can be got by specifying the language parameter.

For the languages and charsets esProc supports, please refer to the Function Reference.

2. Sort records in a data table

We often need to sort the records in a data table as needed during data computing. For example:

  A
1 =demo.query(“select EID,NAME,SURNAME,GENDER,BIRTHDAY from EMPLOYEE where EID<6”)
2 =A1.sort(NAME+” “+SURNAME)
3 =A1.sort(BIRTHDAY)
4 =A1.sort@z(age(BIRTHDAY))

A1 selects some of the employee information:

esProc_related_sort_rank_11

A2 sorts the records according to the employees’ full names. Result is as follows:

esProc_related_sort_rank_12

A3 sorts the records again by birthdays. Result is as follows:

esProc_related_sort_rank_13

Then A4 sorts the records by ages. Result is as follows:

esProc_related_sort_rank_14

So we can see that the records in a table sequence or a record sequence can be sorted conveniently as needed by using sort function and a new record sequence will be generated and returned.

Let’s look at the result of A2, which has been sorted by full names. But the sorting process is usually that the records of employees are sorted first by surnames and then those of employees who have the same surnames are sorted again by names. Multiple fields are needed for sorting data in this situation.

  A
1 =demo.query(“select EID,NAME,SURNAME,GENDER,BIRTHDAY from EMPLOYEE”)
2 =A1.sort(SURNAME,NAME)
3 =A1.sort(GENDER:1,SURNAME:-1,NAME:-1)

A1 selects all the employee records, as shown below:

esProc_related_sort_rank_16

A2 sorts the data first by SURNAME and then by NAME. Result is as follows:

esProc_related_sort_rank_17

We have learned from the first section that, when using functions like sort, psort, rank, ranki and so on, @z option can be added to them to adjust the sorting order. But in the case that the data are sorted according to multiple fields, @z option cannot be used to because the sorting order for each field may not be the same, though we can add a certain mark to each field to specify the sorting order. For example, in the expression in A3, GENDER:1 means sorting data by genders in ascending order and SURNAME:-1 means sorting data by surnames in descending order. If there is no such kind of marks, the data will be sorted in ascending order. The sorting result of A3 is as follows:

esProc_related_sort_rank_18

If the records waiting to be sorted are non-English strings, we can also specify the language parameter in sort function.

Sometimes we sort the records in a table sequence just for getting the top few ones. And top function or topx function can be used for this purpose. For example:

  A
1 =demo.query(“select EID,NAME,SURNAME,GENDER,BIRTHDAY from EMPLOYEE”)
2 =A1.top(age(BIRTHDAY);5)
3 =A1.topx(age(BIRTHDAY);5)
4 =A1.topx(-age(BIRTHDAY);5)
5 =A4.(-~)

A2 gets the records of five youngest employees and A3 computes the five youngest ages. Results of A2 and A3 are as follows:

esProc_related_sort_rank_20

@z option cannot be used in either top function or topx function. When computing five oldest ages, we should add a negative sign before the expression for data sorting, as the code in A4 shows. The final result will be got after the data arrangement in A5. Results of A4 and A5 are as follows:

esProc_related_sort_rank_21

The same as we deal with the numerical data in the above example, the method of getting the top few rows in descending order by adding a negative sign before the expression for sorting can also apply to the data of date/time type and of string type. For example:

  A
1 =demo.query(“select EID,NAME,SURNAME,GENDER,BIRTHDAY from EMPLOYEE”)
2 =A1.top(-BIRTHDAY;5)
3 =A1.topx(-NAME;5)
4 =A3.(-~)

A2 gets the records of the five youngest employees:

esProc_related_sort_rank_23

A3 gets the last five employee names that appear in alphabetical order. The result needs to be rearranged in A4 for viewing. The final result is as follows:

esProc_related_sort_rank_24

It can be seen that all the names are Zarchary.

3. Compute rankings in a table sequence

There are two common ways for computing rankings: rankings with/without duplicate values. For example, when there are two players tied for first place in a competition and if duplicate rankings are to be counted, the next player following them will be the one getting the third place. If duplicate rankings won’t be counted, the second place will follow without having to consider the extra ranking. In esProc, rank is used to compute the rankings and ranki to compute the ranking of a certain member. The use of @i option is to remove the duplicate values before the rankings are computed. For example:

  A
1 =demo.query(“select * from EMPLOYEE”)
2 =A1.rank(SALARY)
3 =A1.rank@i(SALARY)
4 =A1.ranki(10000,SALARY)
5 =A1.ranki@i(10000,SALARY)

A2 computes the rankings of salaries of all employees. A3 computes them without duplicate values. Results of A2 and A3 are as follows:

esProc_related_sort_rank_26

Now let’s look at the employee data in A1:

esProc_related_sort_rank_27

It can be seen by comparison that the employees with the same salary, such as Rebecca and Emily, get the same ranking for their salaries. But obviously they rank much higher in the result of A3, which is caused by the method of ranking without duplicate values, with which the same rankings won’t affect the rankings following them. In fact it is the result in A2 that reflects the rankings more objectively.

A4 computes the ranking of 10,000 in employee’s salaries; A5 computes the ranking of 10,000 by removing the duplicate values. Results of A4 and A5 are as follows:

esProc_related_sort_rank_28

And we also get a higher ranking with the method of rankings without duplicate values. rank@i and ranki@i compute rankings in a way similar to that in which id function is used to list all the different rankings before the result is computed.

FAVOR (0)
Leave a Reply
Cancel
Icon

Hi,You need to fill in the Username and Email!

  • Username (*)
  • Email (*)
  • Website