Performance Test of Text File Processing with esProc

Blog 4545 0

This article aims to test esProc’s performance in processing text files using an example of data query and filtering and through the comparison with Java and Perl performing the same processing.

The orders.txt file storing order records is used as the test data. It is imported in esProc as follows:

ORDERID CLIENT SELLERID AMOUNT ORDERDATE NOTE

1  287  47   5825  2013-05-31  gafcaghafdgie f ci…

2  89  22   8681  2013-05-04  gafcaghafdgie f ci…

3  47  67  7702   2009-11-22  gafcaghafdgie f ci…

4  76  85  8717   2011-12-13  gafcaghafdgie f ci…

5  307  81  8003   2008-06-01  gafcaghafdgie f ci…

6  366  39  6948   2009-09-25  gafcaghafdgie f ci…

7  295  8  1419   2013-11-11  gafcaghafdgie f ci…

8  496  35  6018   2011-02-18  gafcaghafdgie f ci…

9  273  37  9255   2011-05-04  gafcaghafdgie f ci…

10  212  0  2155   2009-03-22  gafcaghafdgie f ci…

NOTE is the string-type field for increasing each record’s length and hasn’t any practical meaning.

Querying and filtering criteria: The CLIENT is 191 and the ORDERDATE is between 2013-09-01 and 2013-11-01.

Data amount: 28G

Hardware configuration of the test machine: Normal PC

CPU: Core(TM) i5-3450  4 cores and four threads

Memory capacity: 16GB

SSD

 

esProc script select.dfx for data filtering:

A B C
1 4 =date(“2013-09-01”) =date(“2013-11-01”)
2 fork to(A1) =file(“/ssd/data/orders.txt”).cursor@tz(orderid:string,client:string,
sellerid:string,amount:float,orderdate:date,note:string;,A2:A1)
3 =B2.select(client==”191″ && orderdate>B1&& orderdate<A10).fetch()
4 result B3
5 =A2.conj()

Java program for data filtering:

package files;

import java.io.BufferedReader;

import java.io.File;

import java.io.FileInputStream;

import java.io.FileOutputStream;

import java.io.InputStreamReader;

import java.text.ParseException;

import java.text.SimpleDateFormat;

import java.util.ArrayList;

import java.util.Date;

import java.util.HashMap;

import java.util.List;

import java.util.Map;

public class BigFilter {

public static void myBigFilter(Date start,Date end,String client) throws Exception{

String path=”/ssd/data/”;

SimpleDateFormat sdf = new SimpleDateFormat(“yyyy-MM-dd”);

File file = new File(path+”orders.txt”);

FileInputStream fis = null;

fis = new FileInputStream(file);

InputStreamReader input = new InputStreamReader(fis);

BufferedReader br = new BufferedReader(input);

String line=null;

long len=0;

int index=-1;

List resultList= new ArrayList();

if ((line = br.readLine())== null) return;

while((line = br.readLine())!= null){

String tmp_line=line;

len++;

index=line.indexOf(“\t”);

String orderid1=line.substring(0,index);

line=line.substring(index+1);

index=line.indexOf(“\t”);

String client1=line.substring(0,index);

line=line.substring(index+1);

index=line.indexOf(“\t”);

String sellerid1=line.substring(0,index);

line=line.substring(index+1);

index=line.indexOf(“\t”);

float amount1=Float.parseFloat(line.substring(0,index));

line=line.substring(index+1);

index=line.indexOf(“\t”);

Date orderdate1 =sdf.parse(line.substring(0,index));

line=line.substring(index+1);

String note1=line;

if (client1.equals(client)){

Map<String,Object> emp=new HashMap<String,Object>();

emp.put(“orderid”,orderid1);

emp.put(“client”,client1);

emp.put(“sellerid”,sellerid1);

emp.put(“amount”,amount1);

emp.put(“orderdate”,orderdate1);

emp.put(“note”,note1);

resultList.add(emp);

}

}

System.out.println(“len=”+len);

}

public static void main(String[] args) throws Exception {

SimpleDateFormat df = new SimpleDateFormat(“yyyy-MM-dd HH:mm:ss”);

Date begin=new Date();

System.out.println(“begin:”+df.format(begin));

Date start=new SimpleDateFormat(“yyyy-MM-dd”).parse(“2013-09-01”);

Date end=new SimpleDateFormat(“yyyy-MM-dd”).parse(“2013-11-01″);

String client=”191”;

myBigFilter(start,end,client);

long diff = (new Date()).getTime() – begin.getTime();

System.out.println(“::end::time=”+diff/1000);

}

}

Perl program for data filtering:

#!/usr/bin/perl -w

my ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = localtime;

$year += 1900;

$mon += 1;

$begin=time();

my $datetime = sprintf (“begin:%d-%02d-%02d %02d:%02d:%02d”, $year,$mon,$mday,$hour,$min,$sec);

print $datetime.”\n”;

$start= 20130901;

$end = 20131101;

open(FILE_IN,”/ssd/data/orders.txt”) or die “Can’t open txt, $!”;

<FILE_IN>;

$perIns = <FILE_IN>;

while(defined($perIns = <FILE_IN>))

{

@row=split(“\t” , $perIns);

$orderid=@row[0];

$client=$row[1];

$sellerid=$row[2];

$amount=$row[3];

$orderdate = $row[4];

@row1=split(” “,$row[4]);#print “@row1” . “\n”;

@row2=split(“-“,$row1[0]);#print “@row2” . “\n”;

$orderdate= $row2[0]*10000+$row2[1]*100+$row2[2];

if ($client==191 && $orderdate>$start && $orderdate<$end)

{

push(@inputFileArray,[@row]);

}

}

$t=time()-$begin;

print “time:$t”. “\n”;

 

Test result:

esProc (Single thread) Java Perl esProc (4 parallel threads)
Execution time 534 seconds 394 seconds 604 seconds 159 seconds

 

Conclusion:

Compared with Java, esProc, a language interpreted and implemented by Java, suffers only slight performance loss using single thread in handling a computational task like data filtering. Java hardcoding is no more than twice times as fast as esProc. Both being the interpreted language, Java-based esProc overtakes C-based Perl in performance.

Using multithreaded processing, esProc’s performance is significantly improved and esProc code is simple. Java’s performance is much improved too when using multithreads, but with complex code. Perl doesn’t have any advantages with its complex multithreading code and mediocre performance using single thread.

On the whole, esProc is the most capable with its good performance and simple code.

In addition, it is found during the test that the Perl functions for converting string-type data to date/time/datetime-type data (Date::Calc package’s Mktime and Time::Piece package’s timegm) have extremely poor performance. It takes Perl 103 seconds to filter a text file of 1G size, compared with 18 seconds with esProc and 7 seconds with Java. In view of this, the above code performs numerical comparisons instead of using conversion functions. By doing so, it takes Perl 20 seconds to filter the text file of 1G size. But you cannot handle complicated date and time handling with this approach, like finding out what day the last day of a certain month is. Even if there is more efficient Perl package for date and time handling, the impact on the result is little and no further will be carried out. As Java String.split() function also performs poorly, indexO() and substring() functions are used in the above code to split a string apart.

FAVOR (1)
Leave a Reply
Cancel
Icon

Hi,You need to fill in the Username and Email!

  • Username (*)
  • Email (*)
  • Website