The group operations performed on tabular data generated from text files include algorithms like group and aggregate, obtaining distinct values, group merging and so on, which can be realized through high-level languages like JAVA or scripting languages like Python. But these two types of languages provide only limited support for the structured-data computing, generating complicated code for knotty cases and making it hard to produce efficient code.
With comprehensive class libraries for structured-data computing, esProc is great at handling the group operations based on tables imported from text files. The ways it works include independent operation and receiving invocation from either the command line or a JAVA application. There’s discussion about this in the article How esProc Implements Text Processing.
Here follows some typical scenarios about implementing the grouping algorithms for text-based tabular data and offers their solutions in esProc.
Simple grouping and aggregation
The sales.csv file stores sales orders. Requirement: Group data by SellerId to calculate the total sales amount per seller and write the results to a new file. Below is the source file:
Explanation: First import the source file as a structured two-dimensional table; then perform the group and aggregate.
By default The import function imports all fields using tab as the separator; @t means importing the first row as the field names. The groups function can perform group and aggregate on a two-dimensional table. The export function writes the two-dimensional table to a file. To write the tabular data to another data table having the same structure, use the code OracleDB.update(A2,Tsales).
Here’s the result:
1. If the source data has been ordered by SellerId, we could use @o to speed up the data handling by only comparing the adjacent records with the code =A1.groups@o(SellerId;sum(Amount)).
2. esProc allows for grouping data according to multiple fields and the implementation of various aggregate algorithms. To calculate the sales amount and count the orders for each seller per year, for example, we could use this code: A1.groups(SellerId,year(OrderDate);sum(Amount),count(OrderID)).
Here’s the result:
3. When the source file is too big to be entirely loaded into the memory but the aggregate result would be smaller (this is what usually happens), we can import the file as the cursor and perform the aggregate with the group function, thus the code will be file(“D:\\sales.txt”).cursor@t().groups(SellerId;sum(Amount)).
4. If the aggregate result set isn’t small enough to be completely put into the memory, we could carry out the aggregation using groupx function. Unlike the group function, groupx returns a cursor.
This scenario is about removing the duplicate rows in the Sales.csv file, that is, for two rows with the same OrderID, only keep the first one. Below is a selection of the source file:
Explanation: Group the imported data by OrderID and get the first record of each group without aggregation. The group function is able to group data without performing the aggregate; @1 means getting the first row of each group and returning the results as a new two-dimensional table.
Here’s the result:
1. It’s a more practical practice to retain records according to a certain condition. To keep the record with the earliest date in each group, use the following code: A1.group(OrderID).(~.maxp(-OrderDate)).
Here’s the result:
To find the first of each group of records sorted by OrderDate in ascending order and by Amount in descending order, we can find the records with earliest OrderDate and then find the one from them with the greatest amount. The code is A1.group(OrderID).(~.maxp([-OrderDate,Amount])).
If there are multiple records satisfying the condition and we want them all, use A1.group(OrderID).conj(~.maxp@a([-OrderDate,Amount])) .
To get the top two from the multiple eligible records, use A1.group(OrderID).conj(~.top(2,OrderDate,-Amount)).
2. Sometimes it is the duplicate records that we want, then get them with A1.group(OrderID).select(~.len()>1).conj().
Here’s the result:
3. Sometimes we simply want the distinct values of a certain field. To get the client list, for instance, use A1.id(Client). The result:
The id function performs the distinct operation, so it is the equivalent of A1.group@1(Client).(Client).
4. Other times we want all the rows that have different field values, instead of filtering the records by ID. In those cases we do it with A1.group@1(OrderID,Client,SellerId,Amount,OrderDate).
Without intending further structured-data computation, we can import the file as a set of rows and perform distinct operation using file(“D:sales.csv”).import@tsi().id(). This is what we get:
Concatenating strings by groups
The sales.csv file stores sales orders, based on which we want a table of client lists having three fields – year, month and comma-separated client list. Below is the source file:
Explanation: Group the imported data by the two parts of OrderDate, get the set of clients in each group and convert them to comma-separated strings. Here’s the result:
The empContact.csv file has six columns, the first four of which contain employee information where duplicate values exist; the fifth column holds the ways of contact and the last column is the specific contact information. Each employee has at least one way for contact. Now we want to merge the records to make a new table where the fifth, sixth and seventh columns are work phone, cell phone and work email. Below is source file:
|2||=A1.group(#1,#2,#3,#4;~.align([“work phone”,”cell phone”,”work email”],#5):g)|
Explanation: Group the imported records by the first four columns and align records in each group according to the order of work phone\cell phone\work email (the duplicate values will be automatically filtered away); then build a new two-dimensional table by sequentially retrieving the records. Here’s the result:
Grouping by sequence numbers
In the following file, every three lines correspond to one record. For example, the first record is 26\TAS\2009-08-05. We want to rearrange it into a two-dimensional table. Here’s the source data:
Explanation: Import the file as a sequence, during which @s means that fields won’t be extracted; group the sequence every three lines, during which “#” represents the sequence number of each line and “\” represents integer division; finally create a table sequence based on the groups, during which ~(1) is the first member of the current group.
Extended information: If the file has a complicated format. For example, there are three fields in the second line of each group:
In this case, we just need to modify A3 into this : =A2.new(~(1):OrderID, (line=~(2).array(“\t”))(1):Client,line(2):SellerId,line(3):Amount,~(3):OrderDate ).
Intra-group order-related calculation
postTurn.txt records the schedule of rotating shifts for a post. The file is grouped by dates and some persons work consecutive shifts. We want to find out how many times the two neighboring names work successive shifts. The fields would be first, next and times. Below is the source file:
Explanation: Group the imported data according to the condition that whether a row contains “#” or not, and each group is a day’s work shift schedule; for each group, put every two neighboring names into a subset and concatenate all the subsets; group each concatenated set every two members, get the first name and the next name from each group and count how many times they appear together. Here’s the result:
Inserting data between groups
In the sales.txt file, the value of SellerId changes every multiple rows. We want to export the OrderId field and add the string “Begin” every time a group of records with new value begins and the string “End” after a group ends. Below is the source file:
Explanation: Write the code directly according to what it asks. @o enables grouping by comparing the adjacent records. To retrieve a certain field from a group of records, use ~.(OrderID). The operator “|” is used to concatenate sets or members of a set. The conj function handles subsets respectively and then concatenates the results.
Here’s the result:
1. Stop using @o when data is unordered, but the performance will decrease a little.
2. To add an aggregate value, the number of records in each group, for example, write A2 as A1.group@o(SellerId).conj((“Begin: “+string(~.count()))|~.(OrderID)|”End”).
Here’s the result:
3. To add strings to both ends of a group of records and to append them to a new file in order, use the following code:
Sampling by groups
The sales.txt file stores ordering data, and we want to retrieve one record from each seller’s records. Here’s the rule: Find out records whose OrderDates are fewer than 60 days after their previous ones, and retrieve one randomly from these sampled records; if there is an insufficient number of records satisfying the condition, just get the record that has the smallest OrderDate. Besides, it is specified that the second record will always appear in each group. Below is the source file:
Group the imported records by SellerId and loop through the groups. For each loop, first sort the group by OrderDate and then take the sample as required. If there’re enough records sampled, retrieve one randomly and append it to B1; if the number of records meeting the condition is insufficient, then get the first record of this group. OrderDate[-1] represents the previous record relative to the current one. Below are the results of three sampling:
Data grouping, splitting and export
The sales.txt file stores a large number of sales orders, which we want to divide into multiple files according to both the year and the month. The format of the file name will be “year-month.txt”. Below is the source file:
Explanation: Group the imported file by the year and the month, parse the data in each group; loop through each group to write it to the file. For instance, the following is the contents of the 2009-01.txt file:
Extended information: When the size of the source data exceeds the memory capacity, we should import the file with the cursor function. If each group of data is still too big to be accommodated by the memory, we can use the groupx function to group the cursor file. The coding structures for both cases are the same.
Handling the big file by groups
The sales.txt file is too big to be held by the memory, thus we need to filter out the records of the clients whose total order amount is greater than 10,000, and write them to another file. The source data is already ordered by SellerId, as shown below:
|2||for A1;SellerId||if A2.sum(Amount)>=10000||=file(“e:\\result.txt”).export@at(A2)|
Explanation: Loop through the cursor by SellerId and import a group of data into the memory at a time; sum up the amounts for the current group and write the records to the new file if the sum satisfies the condition. Here’s the result:
If the final result set is relatively small, we can append it to a cell (B1, for instance). In that case C2 should be B1=B1|A2.
If the source data isn’t ordered, we can first sort it using the sortx function, which returns a cursor, too.
The Stock.txt file keeps the information of goods in and out of the warehouse. A same kind of goods may be found in and out several times a day, or there may be nothing in and out during continuous days. The initial inventory of the goods is 0. We use In to represent goods shipped in, and Out for those shipped out. Now we want to calculate the daily inventory for all kinds of goods. Below is the source file:
Explanation: A2 calculates the daily in-and-out quantity for each kind of goods; obtain the complete list of dates according to the earliest and latest dates and store it in B3; group records by items, loop through each group to align the current group of records with B3 and calculate the daily inventory in order. Here’s the result: