Performing File Comparisons in Java: Cases and Solutions

It’s hard to develop code for performing file comparisons – including finding common values or modified records, comparison of big files or multiple fields or files with different structures, and other scenarios, because generally they involve set operations, structured-data handling and multithreaded parallel programming.

You can handle simple file comparisons with the console command, Java, Python and Perl. But all of them are not good at performing set operations and structured computations, and will produce complicated code for multi-threaded processing. This makes the process of comparing multiple fields, big files and the files with different structures cumbersome.

As esProc supports set operations, cursor operations and structured computations, and has simpler code for expressing the multithreaded parallel algorithm, it’s convenient to use it to realize the algorithms for file comparison. It is easy to use and supports independent operation, console execution, invocation by reporting tools and Java applications. More details can be found in How esProc Implements Text Processing.

The following explains how esProc handles file comparisons through examples.

Find common field values of two files

Both the f1.txt file and the f2.txt file have Name and Dept fields and use the first row as the column names. You need to find the common Name values of the two files. Below are selections from the files:

f1.txt:

Name Dept

Rachel Sales

Ashley R&D

Matthew Sales

Alexis Sales

Megan Marketing

f2.txt:

Name Dept

Emily HR

Ashley R&D

Matthew Sales

Alexis Sales

Megan Marketing

esProc script:

	A	B
1	=file(“E:\\f1.txt”).import@t()	=file(“E:\\f2.txt”).import@t()
2	=[A1.(Name),B1.(Name)].isect()

Import the files into the memory using import function and perform the intersection operation using isect function. Both files use the default separator tab. @t imports the first row as column names, enabling the direct use of Name and Dept to reference the corresponding fields in the subsequent computations. If the first row is not column names, use the default column names, such as _1 and _2, to reference fields.

The isect function is equal to the intersection operator “^”, so the code can be written as A1.(Name) ^ B1.(Name)

Here’s the result:

esProc_text_comparison_4

Tips:

1. You can use an intersection to find common field values. Similarly you can use a union, difference and concatenation to get the corresponding results.

To get all the names in f1.txt and f2.txt (during which duplicate values will be combined into one), use union function or the union operator “&”. Here’s the result:

esProc_text_comparison_5

To get the different field values of f1.txt and f2.txt, use diff function or the difference operator “\”. Here’s the result:

esProc_text_comparison_6

To concatenate the values of Name fields from f1.txt and f2.txt and retain the duplicate values, you can use conj function or the concatenation operator “|”. Here’s the result:

esProc_text_comparison_7

2. To compare complete records, you can use @ts options to import each file as a set of strings, as shown below:

	A	B
1	=file(“E:\\f1.txt”).import@ts()	=file(“E:\\f2.txt”).import@ts()
2	=[A1,B1].isect()

Find the modified records in the CSV file

old.csv is the original file, and new.csv is the file resulted from adding, deleting and modifying the data of old.csv. Both files have the same logical primary key – username and date. Now you need to find the records in old.csv which have been modified.

Below is a selection of the data:

Old.csv

New.csv

userName,date,saleValue,saleCount

Rachel,2015-03-01,4500,9

Rachel,2015-03-03,8700,4

Tom,2015-03-02,3000,8

Tom,2015-03-03,5000,7

Tom,2015-03-04,6000,12

John,2015-03-02,4000,3

John,2015-03-02,4300,9

John,2015-03-04,4800,4

userName,date,saleValue,saleCount

Rachel,2015-03-01,4500,9

Rachel,2015-03-02,5000,5

Ashley,2015-03-01,6000,5

Rachel,2015-03-03,11700,4

Tom,2015-03-03,5000,7

Tom,2015-03-04,6000,12

John,2015-03-02,4000,3

John,2015-03-02,4300,9

John,2015-03-04,4800,4

You can see that in new.csv the third and the fourth rows are the newly added and the fifth row is the modified. The third row in old.csv is deleted.

esProc script:

	A	B
1	=file(“d:\\old.csv”).import@t(;”,”)	=file(“d:\\new.csv”).import@t(;”,”)
2	=A1.sort(userName,date)	=B1.sort(userName,date)
3	=new=[B2,A2].merge@d(userName,date)
4	=diff=[B2,A2].merge@d(userName,date,saleValue,saleCount)
5	=update=[diff,new].merge@d(userName,date)

Import each file with the comma as the separator and sort data by the primary key. After that, get the new records first (because they have different primary keys) and store them in the variable new; then find the different records (which may have different values in any field) and store them in the variable diff; finally find the difference between diff and new, that is, the modified records (which have the same primary key but different other fields). Here’s the result:

esProc_text_comparison_11

Tips: you can find the deleted records with the code [A2,B2].merge@d(userName,date)

Find common rows of two big files

Both file1.txt and file2.txt store a great deal of strings. You need to find their common rows (intersection). Below are selections of data:

file1.txt

file2.txt

C:\Windows\System32\0409C:\Windows\System32\9999

C:\Windows\System32\2hps.ico

C:\Windows\System32\aaclient.dll

……

C:\Windows\System32\zh-TW\msimsg.dll.mui

C:\Windows\System32\zh-TW\msprivs.dll.mui

C:\Windows\System32\zh-TW\WMPhoto.dll.mui

C:\Windows\System32\zh-TW\WMPhoto.dll1.mui

C:\Windows\System32\0409C:\Windows\System32\2052

C:\Windows\System32\2hps.ico

C:\Windows\System32\aaclient.dll

……

C:\Windows\System32\zh-TW\msimsg.dll.mui

C:\Windows\System32\zh-TW\msprivs.dll.mui

C:\Windows\System32\zh-TW\WMPhoto.dll.mui

C:\Windows\System32\zh-TW\WMPhoto.dll2.mui

esProc script:

	A	B
1	=file(“E:\\file1.txt”).cursor()	=file(“E:\\file2.txt”).cursor()
2	=[A1.sortx(_1),B1.sortx(_1)].merge@xi(_1)
3	=file(“E:\\result.txt”).export(A2)

The cursor function opens each file in the form of cursor, without importing data all at once. By default tab is used as the column separator in importing all fields, which will be named _1、_2、_3…_n automatically. The sortx function can sort a cursor; the merge function performs a merge on the ordered data; @x means merging the cursors and @i specifies that the merge result be the intersection. The export function can write cursor data into a file, as shown below:

C:\Windows\System32\0409C:\Windows\System32\2052\VSJitDebuggerUI.dllC:\Windows\System32\2hps.icoC:\Windows\System32\ACCTRES.dll

C:\Windows\System32\ARP.EXE

…

Tips: The merge function uses @i to get the intersection; uses @u to get the union; and uses @d to get the difference. By default it performs the concatenation, which combines the two files according to the sequence of the strings and allows duplicates.

Compare a big file and a small file

file1.txt contains big data, but file2.txt can be loaded into the memory. You need to find the common rows of the two files.

esProc script:

	A	B
1	=file(“e:\\file1.txt”).cursor()	=file(“e:\\file2.txt”).import()
2		>B1.primary(_1).index()
3	=A1.select(B1.find(~._1))
4	=file(“E:\\result.txt”).export(A3)

Import the smaller file into the memory and create a hash index to increase the efficiency in performing the set operations. The primary function can be used to set the primary key and index can be used to create the hash index. The find function finds the common data of cursor A1 and cursor B1, i.e. the intersection.

Tips:
1.To find the difference, A3’s code can be written as =A1.select(!B1.find(~._1))

2. To get the union, you can first find the difference between file1 and file2 and then union the result with file2.

	A	B
1	=file(“e:\\file1.txt”).cursor()	=file(“e:\\file2.txt”).import()
2		>B1.primary(_1).index()
3	=A1.select(!B1.find(~._1))
4	=file(“E:\\result.txt”).export([A3,B1.cursor()].conj@x())

Improving performance of file comparisons using parallel computing

All the above examples use serial computation, yet performance can be increased by using parallel computation. You can use several threads to retrieve a file. Each thread accesses a part of the file with the cursor and performs the set operation. Finally the results of processing the cursors will be merged.

Let’s run a test with a 2.77G big file and a 39.93 small file under the same hardware environment. It takes an average of 85 seconds to complete the task using serial computation, and an average of 47 seconds using the parallel computation. The speed has almost doubled. Since the set operation is not complicated, most of the performance improvement happens during the file retrieval from the hard disk. In fact the more complicated the operation is, the more greatly the performance improves.

esProc script for parallel processing:

	A	B
1	=4.(file(“e:\\file1.txt”).cursor@z(;, ~:4))	=file(“e:\\file2.txt”).import().primary(_1).index()
2	=A1.(~.select(B1.find(~._1)))
3	=A2.conj@xm()
4	=file(“e:\\result.txt”).export(A3)

Divide a file into 4 segments and generate 4 cursors to perform the computations, and finally merge the resulting cursors.

It would be good not to divide the file into too many segments. Generally the number is equal to the number of threads the system defines; otherwise some segments will have to wait in queue to be processed. The number of threads can be set in the configuration files. The cursor function uses @z option to divide a file into segments and retrieves a segment using the cursor. A file will be divided roughly because being divided exactly will result in broken rows. To get each row in its completeness, esProc will automatically skip the head row and make up the tail row.

Comparing files with different structures

The Data.txt, a text file separated with tabs, has 6 fields, in which here field contains strings which are delimited by semicolons each. list is a one-column file. Compare the two files, and if a subset of the split here field in a record matches any row of the List.txt, then export it to result.txt.

Below is the source data:

Data.txt

List.txt

field1	field2	field3	here	field5	etc
A	B	2	Gee;Whiz;Hello	13	12
A	B	2	Gee;Whizz;Hi	56	32
E		4	Btm;Lol	16	2
T		3	Whizz	13	3

Gee

Whiz

Lol

esProc script:

	A
1	=file(“d:\\Data.txt”).import@t()
2	>=file(“d:\\List.txt”).read@n()
3	=A1.select(here.array(“;”)^A2!=[])
4	=file(“d:\\result.txt”).export@t(A3)

In the above script, the select function query the desired data; the array function splits a string; “^” means getting the intersection; and “[]” represents a null set.

Here’s the result:

field1	field2	field3	here	field5	etc
A	B	2	Gee;Whiz;Hello	13	12
A	B	2	Gee;Whizz;Hi	56	32
E		4	Btm;Lol	16	2

Find strings that aren’t included in a HTML file

Find which strings of the list tagList aren’t included in the a.html file. TagList=”Marketing”,”sales”,”human resource”,”Finance”,”R&D”,”commerce”, and a.html is as follows:

table.example2 {background-color:transparent;border-collapse:collapse;width:100%;}

table.example2 th, table.example2 td {text-align:center;border:1px solid black;padding:5px;}

table.example2 th {background-color:AntiqueWhite;}

table.example2 td:first-child {width:20%;}

</style>

<tr><th>DeptID</th><th>DeptName</th></tr>

<tr><td>D0121</td><td>sales</td></tr>

<tr><td>D0123</td><td>human resource</td></tr>

</table>

esProc script:

	A
1	=tagLIst=[“Marketing”,”sales”,”human resource”,”Finance”,”R&D”,”commerce”]
2	=file(“e:\\afile.html”).read()
3	=A1.select(pos(A2,~))
4	=A1\A3

Use read function to load in the file as a string; and then use select function to perform a conditional query. The condition is that if A2 (the large string) contains “~” (A1’s current member); finally calculate the difference. Here’s the result:

esProc_text_comparison_23

Hi，You need to fill in the Username and Email！