Having complex formats and unstandardized data, many of the text files are incomputable. They, when used as the data source, need preprocessing to be converted to the structured data or the database table for further query or statistics. Though we can perform this conversion using high-level languages like JAVA, or scripting languages like Python, codes would be very complicated for complex problems, due to their inadequate support to structured-data computing.
With various class libraries for structured-data computing, it’s easy for esProc to accomplish both the average and more complex conversions, through either independent operation or invocation from the command line or JAVA program. More information can be found from the article How esProc Implements Text Processing.
Here follows some typical scenarios involving the conversion of text files to structured data, as well as their esProc solutions.
Specified separator and partial retrieval
The sales.txt file stores ordering data, with columns separated by tabs. Now we need to export the data to the empty table Tsales with the same structure for processing. Here’s the source file:
Explanation: the import function reads all the fields using tab as the default separator into the memory to generate a two-dimensional table. The update function exports the two-dimensional table to the database, with all fields retrieved by default.
If we retrieve only some of the fields and use comma as the separator, then the code should be import@t(OrderID,Client; “,”).
If the file is too big to be entirely loaded into the memory, we could retrieve it out as the cursor and fetch data and export to the database in batches. By fetching, say, 10,000 rows each time, we can issue the following the code:
We can also write the converted data into a file with the code file(“e:\\result.txt”).export@t(A1).
Note: Generally the file that has been converted to the structured data will be exported to a database or written into a file. In the following cases this routine operation will be omitted.
Log.txt is a Java console log file, from which we want to filter out the part during the period from 2013-08-26 to 2013-09-01 and structure it in three fields: time (only the date part is desired), level (the three types of value include ERROR\INFO\DEBUG) and content (all the information after level). Below is the source data:
|2||=A1.(~.array(” “)).new(date(~(1),”[yyyy-MM-dd HH:mm:ss,SSS]”):time, ~(2):level, ~.to(3,).string(” “):content)|
|3||=A2.select(time>=date(“2013-08-26”) && time<=date(“2013-09-01”))|
Explanation: The code first rearranges the file into structured data before making the query. The array function splits the string into a set according to a specified separator. ~(1) represents the first member of the set, and ~.to(3,) represents the members from the third one to the last. The string function concatenates members of a set into a string according to a specified separator. Here’s the result:
The file data.csv contains quotation marks, some of which have a negative effect on the data handling. So we need to remove these quotation marks before exporting data to JAVA. Below is the source file:
Explanation: Import the file and then remove those quotation marks. \ is the escape character. Here’s the result:
If there is an indefinite number of fields, we could use the esProc dynamic function to loop through each field to remove the quotation marks. The code can be written as A1.fno().(A1.field(~,A1.field(~).(replace(~,”\””,””)))).
One scenario requires that all formulas in the text file be parsed as expressions and evaluated before the data is exported. Here’s a data source:
The eval function dynamically parses the string into expression and makes evaluation.
Here’s the result:
Evaluation.txt holds the performance evaluations of employees. There are three types of evaluations, string-typed and case-insensitive, for each employee. We want to structure the file as a two-dimensional table with the fields including employee, evaluation 1, evaluation 2 and evaluation 3, and by transforming the string-typed evaluation to scores. The correspondence is like this: “improvable”:60,”normal”:70,”good”:80,”very good”:90,”excellent”:100. Below is the source file:
Explanation: Import the source file into two fields according to the colon; convert the characters of the second field to lower case, split the strings into a set by the commas and convert each member to a number. Then we get the three evaluation results by retrieving members of the set based on their sequence numbers. The employee names constitute the first field.
mail.csv stores the email sending records, with the first column being the sender that corresponds to multiple recipients from the 2nd to the (N-1)th column. We want to rearrange the file into a structured two-dimensional table, where one sender corresponds to only one recipient. Below is the source file:
Explanation: Convert the imported lines (except for the first line) to a series of sets by commas; for every set, combine the first member with each of the members from the 2nd to the (N-1)th into records. Here’s what we get:
Dynamic one-to-many conversion
Budget.csv stores budgets of every department in every year. It is nonstandard because the fields contain specific departments. We need to restructure the file into a standardized two-dimensional table including these fields – year\dept\manager\empCount\budget. Note that the names and number of the fields in the source file are not fixed, so a dynamic two-dimensional table is desired. Below is the source data:
Explanation: Split the original field names into two parts – the department names (R&D\sales) and other elements (manager\empCount\budget); create an empty two-dimensional table according to the second part and fill it up with data items from the original file.
Here’s the result:
File splitting, grouping and recombination
ClientSales.txt records the relationships between clients and sellers. The values of client field are enclosed by the brackets, and there’s space between the client and the seller, and the blank line between every two records. Now we need to rearrange the file into a two-dimensional table and store it in the database. The table will have two fields – Client (containing clients) and SalesList (containing all comma-separated sellers the client is contacting with). Below is the source file:
Explanation: Read in the file as a set of lines, remove the blank lines and separate each of the rest of the lines into two parts – Client and Sales by spaces, during which Client should be cleared of the brackets; then group the records by Client, join the Sales in each group together with commas with the new name SalesList. Here’s the structured file:
The sales.txt file stores a list of sellers and the customer.txt stores the list of customers each seller has. The two files correspond with each other through line numbers. What we want is to combine the two files into a structured two-dimensional table having two fields – sales and customer. Below are the source files:
Explanation: Split each line of the customer file into a set according to the commas, remove the numbers and parentheses and retain only the customer names; align the two files according to the line numbers; and finally retrieve the desired fields. Here’s the result:
In the following file, every three lines corresponds one record. For example, the first record is 26\TAS\1\2142.4\2009-08-05. It is required that we transform it to a standardized two-dimensional table.
|3||=A2.new(~(1):OrderID, (line=~(2).array(“\t”))(1):Client,line(2):SellerId,line(3):Amount,~(3):OrderDate )|
First import the file as a sequence, during which @s means not splitting the fields; then group the sequence every three lines, with “#” representing the line number and “\” represents integer division; finally create a table sequence according to each group, where ~(1) represents the first member of the current group. The array function can split a string into a sequence. Here’s the result:
If the file is too big to be loaded into the memory, we should open it as a cursor and handle the data and append the result to a new file or the database in batches. Here’s the code for handling this scenario:
|3||=A2.new(~(1):OrderID, (line=~(2).array(“\t”))(1):Client,line(2):SellerId,line(3):Amount,~(3):OrderDate )|
Grouping a file by certain data item before structuring it
In the data.txt file, each record includes multiple lines, but the number of lines isn’t necessarily same. Yet every record through the file has the fixed fields, they are “Object Type:”, “left:”, “top ” and “Line Color: ”. The first record, for instance, is Symbol1, 14, 11,RGB( 1 0 0 ). We want to structure it into a two-dimensional table. Here’s the original file:
|2||=A1.array(“Object Type: “).to(2,)|
|3||=A2.new(~.array(“\r\n”)(1):OType,mid(~,s=pos(~,”left: “)+len(“left: “),pos(~,”\r\n”,s)-s):L,mid(~,s=pos(~,”top: “)+len(“top: “),pos(~,”\r\n”,s)-s):T,mid(~,s=pos(~,”Line Color: “)+len(“Line Color: “),if(r=pos(~,”\r\n”,s),r,len(~))-s+1):LColor)|
Explanation: Split the file according to “Object Type” and delete the first blank line in each part; create a table sequence and find out the desired fields using the string functions including array, pos, len and mid. Note that there’s probably not a carriage return after the last line, so we need to determine whether or not there is one using if statement. Here’s the final result:
Grouping a big file by certain data item
Order.txt stores ordering data grouped by Client where each client has multiple records and each record has three fields, each of which takes up one line. The fields are OrderID\Amount\OrderDate. The requirement is that the file is rearranged into the semicolon-separated structured data, with the fields Client\OrderID\Amount\OrderDate. Below is the source file:
|2||for A1;#1==”Client”||if A2.#1==”Client”||=A2.#2|
Explanation: Open the file as the cursor and divide the data into 2N groups (N is the number of Client) by looking at whether the first column is “Client” or not; retrieve Client from the single-line record in the odd group and compose M/3 record from the M records of the even group; then append the data items under the same Client to the result.txt. Here’s the result:
Grouping data by certain condition
The data.txt file stores multiple groups of data. The first line of each group is the customer code and the rest of the lines record when the customer purchases different beverages. There’re altogether seven kinds of beverages – they are “liquors,Vodka,Rum,Liqueur,liquors,Whisky,Cordials”. We need to create a two-dimensional table based on this file for further data handling. The table will have eight fields: Client and the seven beverages. Below is the source data:
|5||=A2.record(A4(1) | B4.(#1))|
Explanation: A2 creates an empty 8-field two-dimensional table according to the parameter list; group the source data by looking at whether or not the first character is letter and each group corresponds to one record in the final result; loop through each group to compose a record to append to A2. Here’s the final result:
Retrieving records based on fields
In the following file column.txt, from the 11th field each field is a record, like “XNo1\99” as shown below. Now we want to retrieve out these records.
Explanation: Import the file as a set of lines; from the 11th column on, extract each column as a separate field and form a 5-field two-dimensional table; concatenate these separated columns into a set with five rows; finally split each row into two fields by a separator to generate a two-dimensional table. Here’s the result:
By using the above algorithm, we assumed that the file has a fixed five columns. But if the number of the columns isn’t fixed, we need to write dynamic code for A2, as shown below:
Transposition and aggregation
The space-separated file maxVolumn.txt records the most popular products in each chain store for every random checking. Now we want to find out how many times each of the most popular products in a store is selected, and export the data as a 3-field text file. The fields include store\product\times. Below is the source file:
Explanation: Group each column (a store) and perform aggregate to find out how many times the most popular products respectively are selected; add a new store column to each group and aggregate result and concatenate the data of each store.
The fno function returns the number of fields a two-dimensional table has. The field function retrieves a column by its sequence number. And the new function constructs a new two-dimensional table based on an existing one. Besides, ~ represents the current member and # represents its sequence number.
Here’s a selection of the resulting table:
Sorting and rearranging based on column grouping
The tab-separated file check.txt stores the random checking results of various beverages. dateN represents the date of the Nth checking, and rateN represents the qualification rate of the Nth checking. We want to sort the records by qualification rates in descending order, while maintaining the original data structure. Below is the source file:
It’s hard to implement this order-related algorithm in SQL, but it’s easier to do it in esProc:
Explanation: From the second field, group the records every two fields and sort each group by qualification rates; join each sorting result with the first field in order to form a complete record.
The array function converts the field values of a record to an ordered set, in which t(1) is the first member and t.to(2,) represents members from the second to the last. group((#-1)\2) groups every two members together by sequence numbers. The sort function sorts data in ascending order by default, and we use “-” to sort in the opposite order. Here’s the result: