Code Examples of Handling Large Text Files in esProc

To deal with a large text file that can’t be wholly held in the memory, you need to import it in segments and process each segment separately. It’s knotty. Sometimes even multithreaded parallel processing is needed so as to increase performance. But since most of the programming languages don’t support basic class libraries, it’s extremely complicated to realize those implementations.

Yet being a professional structured data handling tool, esProc can serve as the class library with its support for cursor functions, as well as the ability of ensuring the integrity of each imported segment of file, simplifying the parallel algorithm and handling large files in an easy way. The Java main program can access the esProc result through JDBC. Find more details from How to Use esProc as the Class Library for Java.

Below are some typical examples of handling large text files in esProc.

Querying large files

The size of sales.txt exceeds the maximum memory capacity. The task is to query this file according to parameters. Below is the source file:

OrderID	Client	SellerId	Amount	OrderDate
26	TAS	1	2142.4	2009-08-05
33	DSGC	1	613.2	2009-08-14
84	GC	1	88.5	2009-10-16
133	HU	1	1419.8	2010-12-12
32	JFS	3	468	2009-08-13
39	NR	3	3016	2010-08-21
43	KT	3	2169	2009-08-27

The code:

	A
1	=file(“D: \\sales.txt”).cursor@t()
2	=A1.select(OrderDate>=startDate && OrderDate<=endDate)

Open the file whose size is larger than the available memory size in the form of the cursor; then use the select function to query data and return result set as a cursor. The Java main program can access the result set in the JDBC stream style.

startDate and endDate are the input parameters, for example, a given time period from 2010-01-01 to 2010-12-31. Here’s the result:

esProc_text_bigtextcode_1

To speed up the retrieval, you can use the multithreading technique by adding @m option, that is, =file(“D: \\sOrder.txt”).cursor@tm(). But the multithreaded parallel retrieval can’t ensure the original data order.

Parallel query processing

The sales.txt file stores a large number of ordering records. The requirement is to perform a high-performance conditional query. Condition: sellerid=1 and client=TAS and orderdate>=2013.

The code:

	A	B
1	4
2	fork to(A1)	=file(“/tools/data/sales.txt”).cursor@t(;A2:A1)
3		=B2.select(sellerid==1 && client==”TAS” && year(date(orderdate))>=2013).fetch()
4	=A2.conj()

Explanation:

Set 4 parallel threads that each retrieve one block of the file, perform the query and return result to the main program that is responsible for concatenating the returned result sets. Here the select function performs the query on cursor and returns cursor too. Usually the query result set is small enough that can be retrieved into the memory with fetch function.

The cursor function divides the file by bytes into 4 blocks, which have approximately the same sizes, and retrieves only one block. By giving up the head line and complementing the tail line, esProc is capable of ensuring that each block is retrieved with complete rows.

According to test, it takes only 24 seconds to perform a query on a file of 3.4G size using the above program. For more details, see Performance Test of esProc File Traversal Algorithm.

Data sorting

Based on the sales.txt file, perform sorting by client code in descending order and by the year and the month in ascending order.
The code:

	A
1	=file(“D:\\sales.txt”).import@t()
2	=A1.sortx(-Client,year(OrderDate),month(OrderDate))

Use “-” to represent sorting in descending order. The year and the month need to be obtained through computation.
Result:

esProc_text_bigtextcode_2

Grouping and aggregation

Calculate the annual sales amount and the annual number of orders for each salesperson.

The code:

	A
1	=file(“D:\\sales.txt”).cursor@t()
2	=A1.groups(SellerId,year(OrderDate);sum(Amount),count(~))

The groups function can perform grouping and aggregation simultaneously. ~ stands for each group or the current group. count(~) is equal to count(OrderID).

Result:

esProc_text_bigtextcode_3

This example assumes that the source file is large but the aggregate result set is relatively small. If the result set is too big to be held by the available memory, you can use groupx function to perform the group and aggregate operations and return result as a cursor, using the following code: A1.groupx(SellerId,year(OrderDate);sum(Amount),count(~))

Data filtering by group

Here’s a large file sales.txt that is ordered by SellerId. Suppose you want to find out the records of the salespeople whose sales amounts are greater than 10000.

The code:

	A	B	C
1	=file(“d:\\sales.txt”).cursor@t()
2	for A1 ;SellerId
3		if A2.sum(Amount)>10000	=file(“e:\\result.txt”).export@a(A2)

Run a loop to retrieve a set of data with the same SellerId from A1’s cursor each time and perform data handling to calculate the sales amount for each salesperson (SellerId) and write the corresponding record into the file result.txt if the calculation result is greater than 10000.

The statement for cs;x means fetching a set of records with the same x field from cursor cs at a time. This requires that data be already sorted by x. In this example the file has been sorted. But if it hasn’t, it needs to be first sorted with sortx function.

The result.txt file stores the result. Here’s a selection of it:

2	UFS	13	1863.4	7/5/2009
24	KT	13	48	8/1/2010
27	HP	13	538.6	8/6/2010
45	TAS	13	1296	8/28/2010
50	CHOP	13	1420	9/4/2009
64	HP	13	20000	1/2/2009
74	DY	13	144	10/3/2009
78	JKD	13	1497	10/9/2009
111	GC	13	452	11/20/2009
7	OLF	16	625.2	7/11/2009
35	OLF	16	155.4	8/15/2010
60	PWQ	16	3430	1/5/2009
63	SJCH	16	5880	1/2/2009

Parallel grouping and aggregation

Based on the large file sales.txt, group records of the year 2013 by client and sellerid and calculate the number of orders and sales amount for each group.

The code:

	A	B
1	4
2	fork to(A1)	=file(“/files/data/sales.txt”).cursor@t(;A2:A1)
3		=B2.select(year(date(ORDERDATE))==2013)
4		=B3.groups(CLIENT:client,SELLERID:sellerid;count(ORDERID):corderid,sum(AMOUNT):samount)
5		result B4
6	=A2.conj()
7	=A6.groups(client,sellerid;sum(corderid):corderid,sum(samount):samount)

Explanation:

Set the number of parallel threads as 4. Each thread retrieves one segment of the large file to filter out the records of the year 2013, perform grouping and aggregation and return result to the main program, which concatenates all the returned result sets and performs a second grouping and aggregation.

Parallel file comparison

The f1.txt file is large, but f2.txt can be all imported into the memory. You need to find out the common rows of the two files. Below are the source files:

f1.txt:

Name	Dept
Alexis	Sales
Ashley	R&D
Matthew	Sales
Megan	Marketing
Rachel	Sales

f2.txt:

Name	Dept
Alexis	Sales
Ashley	R&D
Emily	HR
Matthew	Sales
Megan	Marketing

The code:

	A	B
1	=4.(file(“d:\\f1.txt”).cursor@ts(;~:4))	=file(“d:\\f2.txt”).import@ts().primary@i(#1)
2	=A1.(~.select(B1.find(~.#1)))
3	=A2.conj@xm()
4	=file(“e:\\result.txt”).export(A3)

Split the first file into 4 segments and handle the 4 generated cursors concurrently. The returned cursor results will be combined later. The conj function combines the returned result sets; @x option means concatenating the cursors and @m option means parallel processing.

According to tests performed on a large file of 2.77G size and a small file of 39.93M size under equal hardware environments, it takes an average of 85 seconds to complete the comparison task with serial computation but only an average of 47 seconds to get it done with parallel computation. The performance is nearly doubled.

Parallel data matching

The file1.txt file stores a great number of strings, from which you need to find lines of data ending with “.txt” and write them into result.txt. Below is a selection of the source file:

C:\.rnd

C:\alipay

C:\bdkv_install.log

C:\file21.txt

C:\Users

C:\Windows

C:\$360Section\360.03BC8E12E084F08DD6215122CE760C71.q3q

C:\$360Section\360.224DF209E0B9F34F4648BE555641F1D7.q3q

The code:

	A
1	=4
2	=A1.(file(“e:\\file1.txt”).cursor(;~:A1))
3	=A2.(~.select(like@c(_1,”*.txt”)))
4	=A3.conj@xm()
5	=file(“e:\\result.txt”).export(A4)

Issue 4 parallel threads that each access a part of the file with cursor and search each cursor in A1 for lines ending with “.txt”. The select function performs the query and the like function finds matching strings. @c option indicates that the matching is case-insensitive, and _1 stands for the first field.

Apart from the like function, you can also use a regular expression to perform the string matching. That is, A2.(~.regex@c(“^c:\\\\windows.*(?<!\\\\(.txt)$)”)).

Getting distinct values in parallel

Obtain a client list according to sales.txt.

The code:

	A
1	=file(“D: \\sales.txt”).cursor@mt(Client)
2	result A1.(Client).groups(~)

Open the file in the form of parallel cursors and retrieve only the Client field which will be grouped in the next step. @m means parallel processing. When performing group operation on grouping fields, the groups function will return a set of values of the grouping field.

Result:

esProc_text_bigtextcode_4

Extended explanation: This example assumes that the result set is relatively small. If the result set is sufficiently large, you can use groupx function to export the cursor data to another file, as shown below:

	A
1	=file(“D: \\sales.txt”).cursor@mt(Client)
2	=file(“D:\\res.txt”).export@t(A1.(Client).groupx(~))

Order-related set operations

Two large files f1.txt and f2.txt are already sorted by Name and Dept. You need to find out the intersection of the two files. Below are the source files:

f1.txt:

Name	Dept
Alexis	Sales
Ashley	R&D
Matthew	Sales
Megan	Marketing
Rachel	Sales

f2.txt:

Name	Dept
Alexis	Sales
Ashley	R&D
Emily	HR
Matthew	Sales
Megan	Marketing

If files are already sorted, use merge algorithm to achieve the result that set operations can get and a better performance. Here’s the code:

	A	B
1	=file(“D:\\f1.txt”).cursor@t()	=file(“D:\\f2.txt”). cursor @t()
2	=[B1,A1].merge@xi(Name,Dept)
3	=file(“D:\\result.txt”).export@t(A2)

The merge function combines data together, in which @x means cursors are being handled, and @i is used for getting intersection, @u for union and @d for difference.

Result:

esProc_text_bigtextcode_5

Related computing

emp.txt is a tab-separated text file whose EId field corresponds to sales.txt’s SellerId field. You need to add and align emp.txt’s Name, Dept and Gender fields to sales.txt.

emp.txt:

EId	State	Dept	Name	Gender	Salary	Birthday
2	New York	Finance	Ashley	F	11000	1980/07/19
3	New Mexico	Sales	Rachel	F	9000	1970/12/17
4	Texas	HR	Emily	F	7000	1985/03/07
5	Texas	R&D	Ashley	F	16000	1975/05/13
6	California	Sales	Matthew	M	11000	1984/07/07
7	Illinois	Sales	Alexis	F	9000	1972/08/16
8	California	Marketing	Megan	F	11000	1979/04/19

The code:

	A
1	=sOrder=file(“D:\\sales.txt”).cursor@t()
2	=emp=file(“D:\\emp.txt”).cursor@t(EId,Name,Dept,Gender)
3	=join@x1(sOrder.sortx(SellerId):s,SellerId;emp.sortx(EId):e,EId)
4	=A3.new(s.OrderID, s.Client, s.SellerId, s.Amount, s.OrderDate,e.Name, e.Dept, e.Gender)

The join function performs join operations and changes the table names to s and e. It performs inner join by default but uses @1 to do a left join and @f to get a full join. It can deal with cursors by adding @x option. Since joining cursors requires that data be sorted, here sortx function is used to perform a sort. After the join is done, get the desired fields from the joining result to form a new two-dimensional structured table. Here’s the result:

esProc_text_bigtextcode_6

Multilevel association

There are 5 source files, in which Orders is a large fact table and Customer, Product, Region and Supplier are small dimension tables. You need to find out the orders whose clients and suppliers are in the same region, group them by city and calculate the number of orders and order amount in each city.

Their relationships are as follows:

esProc_text_bigtextcode_7

The code:

	A	B	C
1	=file(“D:/files/customer.txt”).import@t()
2	=file(“D:/files/product. txt”).import@t()
3	=file(“D:/files/supplier. txt”).import@t()
4	=file(“D:/files/region. txt”).import@t()
5	=A1.switch(city,A4:city)	=A3.switch(city,A4:city)	=A2.switch(sid,A3:sid)
6	4
7	fork to(A6)	=file(“D:/files/orders.txt”).cursor@t(;A7:A6)
8		=B7.switch(pid,A2:pid)
9		=B8.switch(cid,A1:cid)
10		=B9.select(pid.sid.city.region==cid.city.region)
11		=B10.groups(cid.cid:cid;count(oid):count,sum(price*quantity):amount)
12	=A7.conj()
13	=A12.groups(cid;sum(count):count,sum(amount):amount)

Import every file and establish the relationships between the fact table and each of the dimension table; launch 4 parallel threads to filter out the desired orders according to the relationships and group and aggregate them; and finally perform a second grouping and aggregation on the concatenated result sets. The switch function establishes a foreign key relationship, and the conj function combines the result sets returned from the threads.

Comparison between a large file and a small file

Here are a large file f1.txt and another file f2.txt that can be entirely imported into the memory. You need to find out the common rows of both files.

The code:

	A	B
1	=file(“d:\\f1.txt”).cursor@ts()	=file(“d:\\f2.txt”).import@ts()
2		>B1.primary@i(#1)
3	=A1.select(B1.find(~.#1))
4	=file(“E:\\result.txt”).export(A3)

You can import the small file into the memory, create a hash table and perform the set operation. This can increase the efficiency greatly. The primary function is used to set a primary key, with @i used to create the hash index. The find function queries the common data of cursor A1 and cursor B1, which is equal to getting the intersection. Here’s the result:

esProc_text_bigtextcode_8

Extended explanation:

1.To get the difference, A3’s code can be =A1.select(!B1.find(~._1)) .

2.To get the union of file1 and file2, you can first find out the difference and then combine it with file 2.

	A	B
1	=file(“d:\\f1.txt”).cursor@ts()	=file(“d:\\f2.txt”).import@ts()
2		>B1.primary@i(#1)
3	=A1.select(!B1.find(~.#1))
4	=file(“E:\\result.txt”).export([A3,B1.cursor()].conj@x())

Querying large files

Parallel query processing

Data sorting

Grouping and aggregation

Data filtering by group

Parallel grouping and aggregation

Parallel file comparison

Parallel data matching

Getting distinct values in parallel

Order-related set operations

Related computing

Multilevel association

Comparison between a large file and a small file

Hi，You need to fill in the Username and Email！