Structured text files are a common file format, and the tasks of handling them are common too. One approach that comes to mind easily is to import such a file into the database for manipulation. But this is time-consuming and causes heavy database usage. In some cases, there is even not a database available. It’s convenient to process the file directly, but unfortunately, high-level languages generally don’t provide the essential class libraries for structured data handling. Hardcoded computations are complicated and cumbersome, and the code is difficult to maintain.
esProc is geared for calculating structured data by encapsulating rich functions and providing support for set operations, related computing and order-related computations, making it easy to manipulate the structured text files when serving as the class library. Besides, it offers the JDBC interface for invocation from the Java application. (See How to Use esProc as the Class Library for Java).
Below are typical examples of handling structured text files and their esProc solutions.
Insert a record at the position of the second row of sales.txt. Below is the original file:
The insert function adds one or more records. Its first parameter is the position where the record is inserted, and if the parameter is 0, just append the record.
To append a record to a file, there’s no need to import the file. Just use the export@a function as follows:
The create function creates a new two-dimensional table. The record function appends a record to the table.
Use insert@r to insert a batch of records, as the following code does:
To save space, the code for exporting the file will be omitted in the following sections.
Delete the second record from sales.txt. The code is as follows:
You can use the delete function to delete records in batches. Suppose you want to delete records 2,3,5,6,7, you can use the code A1.delete([2,3]|to(5,7)).
Or you can perform the conditional deletion. For example, to delete records whose Amount is less than 1000, use A1.delete(A1.select(Amount<1000)).
Modify the sellerId of the second record of sales.txt into 100, and the Amount to 1000. The code is as follows:
You can also perform a batch modification. For example, to add 10 to the Amount of each of the first 10 records, use: A1.modify(1:10,Amount+10:Amount)
Add a computed column year to sales.txt. The code is as follows:
It’s physically inefficient to remove certain columns. Usually the operation is realized through the equal action of “retrieving desired columns”. Supposing you want to retain the OrderID, Amount and OrderDate fields of sales.txt:
Increase each value of the Amount column of sales.txt by 10%. The code is as follows:
The alternative methods include modify and new functions.
Query sales.txt using parameters that define a time period. The code is as follows:
|2||=A1.select(OrderDate>=startDate && OrderDate<=endDate)|
startDate and endDate are input parameters, such as 2010-01-01 and 2010-12-31 that define a certain time period. Here’s the result:
You are expected to sort sales.txt by the client code in descending order, and by the month and year in ascending order.
Use the sign “-” to sort data in descending order. The year and month need to be obtained through computation.
Grouping and aggregation
Calculate the annual sales amount and the annual number of orders for each salesperson.
The groups function can group data and perform aggregation at the same time. ~ stands for each group or the current group. count(~) is equal to count(OrderID).
Getting distinct values
Display the clients listed by sales.txt.
Removing duplicate values
Find out the first record of each client and of each salesperson from sales.txt.
The group function performs data grouping (without having to aggregate each group). @1 option works to get the first record of each group.
Finding Top N
Finding the 3 top orders for each salesperson in terms of amount.
The top function filters out the top N in each group. The sign “-” means performing sorting in a reverse direction. The conj function concatenates records.
To find the order with the biggest amount, you can use the maxp function.
Set operations (for file comparison)
There are f1.txt and f2.txt, in which the first rows are the column names. The requirement is to find the intersection of the two file’s Name fields. Below are the selections of the files:
The f1.txt file:
The f2.txt file:
The isect function performs intersect operation on sets. A1.(Name) retrieves A1’s Name field to form a set. B1.(Name) retrieves B1’s Name field to form a set. The final result is as follows:
Likewise, the union function gets the union of sets, the diff function finds the difference between sets, and the conj function calculates the concatenation of sets (equal to union all). You can also use operators to replace the functions to produce cleaner and simpler code. Below are counterparts of the isect, union, diff and conj functions:
A1.(Name) ^ B1.(Name)
A1.(Name) & B1.(Name)
A1.(Name) \ B1.(Name)
A1.(Name) | B1.(Name)
The above is the set operations on columns. To perform set operations on rows, just retrieve the file by rows as shown below:
Table1.txt and Table2.txt have been sorted by their respective logical primary key (sort them first if they are not ordered). The requirement is to update Table1 with Table2 by comparing the primary key values. Specific rule is that you update Table1 when the key values are identical but insert the record into it when the key values are different.
Below are the source tables:
The expected result of updating Table1 with Table2 is as follows:
Retrieve Table1.txt and Table2.txt respectively as cursor, and update A1 using B1 according to their logical primary keys. The merge function combines data together and keeps the result ordered as well. @x means cursor manipulation, @u means getting their union. Then write the result set to a new file.
With the use of cursor, you can handle files that have any amount of data, including those extremely large files of course.
If the files are originally not ordered, they need to be sorted before being merged. In that case, A2 can be written as follows:
Order-related set operations
The f1.txt file and f2.txt file have been sorted by Name and Dept, and you are required to find their intersection. Below are the source files:
The f1.txt file:
The f2.txt file:
If files are already sorted, you can use merge algorithm to achieve the result that set operations can get in order to increase performance. Here’s the code:
The merge function combines cursors together, with @i used for getting intersection, @u for union and @d for difference.
If the files to be merged are originally not ordered, they need a sorting with the sort function. But you should note that sorting and merging small files are even slower than handling them through set operations. So it’s reasonable to use the merge algorithm for handling only large files.
emp.txt is a tab-separated text file whose EId field corresponds to sales.txt’s SellerId field. The requirement is to create a sOrder.txt with emp.txt’s Name, Dept and Gender fields added to and aligned with it.
Below is the source data:
|4||=A3.new(s.OrderID, s.Client, s.SellerId, s.Amount, s.OrderDate,e.Name, e.Dept, e.Gender)|
In the above code, the join function performs table joins and changes the table names to s and e. It performs inner join by default but uses @1 to do a left join and @f to get a full join. After the join is done, get the needed fields from the joining result to form a new two-dimensional structured table. Here’s the final result:
The source data includes 5 files, in which Orders is the fact table and Customer, Product, Region and Supplier are dimension tables. Now find out the orders whose clients and suppliers are in the same region, group them by city and calculate the number of orders and order amount in each city.
Their relationships are as follows:
Import every file and establish the relationships between the fact table and each of the dimension table; filter out the desired orders according to the relationships and group and aggregate them. The switch function is used to establish the foreign key relationship.
Finding modified records from the CSV file
old.csv is the original file, and new.csv is the file resulted from adding, deleting and modifying certain records of the old file. Both use userName and date as the logical primary key. The requirement is to find which records in old.csv have been modified.
Below is a selection of the source file:
It’s clear to see that in new.csv the second and third rows are newly added and the fourth row is generated by modifying an old record, and that in old.csv the third row is deleted.
Import each of the files with commas used as the separator and sort it by the key. Then find out the newly added records (whose primary key values are different ones) and store them in the new variable; after that search for the different records between the two files (in which a certain field is different) and store them in the diff variable; finally find the difference between the diff and new, which is the record that has been modified (that has the same primary key but different non-key fields). Here’s the result:
Extended explanation: To find the deleted record, use [A2,B2].merge@d(userName,date).
Comparison of files with different structures
Data.txt is a tab-separated file with 6 fields, in which here field contains semicolon-separated strings. The List file has only a single column. Now compare the two files: if one of the subsets obtained by splitting the here field value matches any row of List.txt, then export the corresponding record to result.txt.
Below is the source data:
In the above code, the select function do the querying job, the array function splits a string, the operator “^” means finding intersection and “” stands for an empty set.
Retrieving data from files in a multilevel directory
The directory “D:\files” contains multilevel subdirectories, each of which includes many files of text format. The requirement is to retrieve a specified row (say, the second row) from each file and write it to a new file result.txt.
The initial value of the parameter path should be “D:\files”, meaning retrieving data starting from this directory. Then call the script (readfile.dfx) recursively with different path value each time.
The directory function gets the list of files from the root directory in the path parameter. @p option means a file name with full path and @d option is added to get the directory name. ~.skip(1) skips a row. ~.fetch@x(1) fetches the row at the current position (i.e. the second row) and closes the cursor.
Exporting the file by group
Divide the sales.txt file that holds a large amount of ordering information into multiple smaller files according to the year and month, with the file name format being “year-month.txt”.
Group the file by the year and month and parse the date, then loop through each group to write it into a file. Below is the exported file of 2009-01.txt:
If the amount of the source data exceeds the memory capacity, import it with the cursor function. If the data in each group still can’t be entirely held by the memory, use the groupx function to perform the grouping. Both cases won’t change the structure of the code.
Calculating the inventory
The Stock.txt file has the records of warehouse management. In one day, the stock of a product may be added or removed for multiple times. Also there may be days when no product is admitted or checked out. The initial quantity of a product is 0, and its increase is In and its decrease is Out. Now you need to find out the daily inventory for every product. Below is the source data:
Explanation: A2 calculates the daily quantities of In and Out respectively for every product. Then get the complete list of dates according to the smallest and biggest dates and store it in B3. Next, group the data by product, loop through each group and align it with B3 to calculate the inventory of the current day. Here’s the result: