Perfect Grouping and Summarizing: Equal, Align, and Enum

Uncategorized 672 0

esProc provides three complete and powerful grouping methods, namely, Equal, Align, and Enum. With these functions, the true meaning of grouping and summarizing is restored and kept.
esproc_groupby_1

Summarizing by group is the commonest function in the field of statistics analysis. It is usually used to analyze the analytic target as a whole. In which, the grouping is to classify the data into several groups following a certain rules, and the summarizing is to aggregate a group of data.

The typical “group + summarize” problem is to group the performance by department and compute the average performance for each department. As far as the natural human thoughts goes, firstly, group by department, secondly, rank the sales person in the sales department by sales values, thirdly, rank the employee in other departments by performance. In this case, there is only the grouping with no summarizing action.

The grouping provided by esProc covers various situations and much more convenient than that of SQL.

I. Case and Comparison

Suppose that there is an insurance company that needs to analyze the insurance policy data in order to find out the sales persons who have signed the highest and the lowest quantity of insurance policy. The insurance table mainly consists of the insuranceID, salesman, insuranceTime, and price fields.

SQL Solution
SELECT salesMan
FROM (SELECT salesMan,
row_number() OVER (ORDER BY isrCount DESC) descOrder,
row_number() OVER (ORDER BY isrCount ASC) ascOrder
FROM (SELECT salesMan,
COUNT(*) isrCount
FROM insurance
GROUP BY salesMan
)
)
WHERE descOrder=1 OR ascOrder=1 ORDER BY descOrder

To solve this problem, the most convenient way is to query through 3 nested loops.

esProc Solution
esproc_groupby_2

Cell A1: Group by salesman. Each group is a collection of all insurance policies from one salesman.
Cell A2: Sort the group by quantity of insurance policy . The ”~” is the respective group of insurance policy to which each salesman corresponds.
Cell A3: Compute the highest and lowest quantity of insurance policy, that is, the first and the last groups in the Cell A2.
Cell A4: Print the name of salesman, that is, the salesman corresponding to the two sets of insurance policies in A3.

It sounds not a complicated question, but wading through a three-level nested sub-query and a zigzag analysis is required if using SQL to solve it. By comparison, esProc user can analyze it step by step in a clear and easy-to-understand flow natural to the human thoughts.

Just like the SQL style, in fact, esProc user can complete all steps in one statement: insurance.group (salesMan).sort( ~.len()).m([-1,1]).(salesMan). However, such SQL-resembling style keeps the esProc user from reusing the grouping result and become an obstacle to further analysis. For example, to check the total sum of insurance policy of these two sales persons, esProc user only need to append a statement A3.(~.sum(price)) in the cell, while the SQL user have to start from scratch and go through the whole process again to repeat the zigzag nested sub-queries.

II. Function Description:

Equal Group

The Equal group is to group by a certain field in the dataset directly, which is similar to that of SQL.
For example, group by month and then count as given below:
insurance.group(month(insuranceTime); ~.count())

Align Group

The criteria for align group is different with that for normal group because the criteria for align group is not from the original dataset but the external source such as parameters and other dataset or set temporarily organized by users.

For example, to check the insurance policies signed by top N salespersons against whom the least customer complaints are lodged. You can retrieve the ordered list of satisfying salesperson from the CRM system, and group the insurance policy data in the insurance policy system according to the name list.

Enum Group

The criterion for enum group is more flexible and even the arbitrary Boolean expression is acceptable.

The align group and enum group are all of the non-equal group type that is different to the normal group because the result of grouping operations could be null set and the same data in the group could be duplicate in other groups. The similar function would be much wieldy for SQL user to implement.

For example, the criterion for grouping is that the insurance policy valued between 100 and 500, between 500 and 2000, between 2000 and 10000, and those higher than 5000. Suppose that we use ABCD to represent the four groups, group and count the policies. Grouping like this could usually result in the below situations:
esproc_groupby_3

1.The extra group is unnecessary. The insurance policies valued below 100 are not the focus of business concerns, and thus not included in the grouping criteria, and not required to appear in the results. For the enum group of esProc, it has the feature of “What You Group Is What You Get”. The result is the true reflection of the analysis goal. By comparison, the SQL group operation will add one unnecessary null group in the group result.
esproc_groupby_4

2.The null group is necessary. If the result of enum group A and D are both null, then the count is 0, representing “There is no insurance policy satisfying A and D.”. If directly grouping via SQL, then A and D will disappear inexplicablely, making the SQL user hard to determine if it result from lacking group analysis on “over-large or over-small valued policy”, or having done the group operations but the count is still 0.
esproc_groupby_5

3.Repeat as required. The insurance policy whose value is 8000 obviously belongs to both the Group C and the Group D. esProc provides the @r option to represent whether it is allowed for a same data appearing in various groups repeatedly. SQL lacks of methods to represent such situations. To perform the group action directly will only allow the data to appear in one group and you will not get the correct result.

III. Advantages

True Group Algorithm

The grouping in esProc does not rely on summarizing, which restores and keeps the true meaning of grouping. esProc user can break the solution to complicated problems down to several simple steps and reuse the previous results, enabling the analyst to analyze it in a mode natural to the human thought.

Complete Grouping Method

The group criteria can be from either the local dataset or the external source. The data belong to multiple groups can be either counted into several groups repeatedly or only to one group. The group criteria can be the simple comparison of equalization or any Boolean expressions.

Agile and Powerful Functionality of Grouping

esProc implements the arbitrary complicated group analysis with group, align, and enum functions. To implement the same level functionality with SQL, it will be much more complicated and wieldy.

FAVOR (0)
Leave a Reply
Cancel
Icon

Hi,You need to fill in the Username and Email!

  • Username (*)
  • Email (*)
  • Website