Performance Test of Processing Text Files with esProc

Blog 1353 0

This article aims to test performance of esProc in processing text files, using an example of data query and filtering and through the comparison with Java and Perl doing the same processing.

Test data is some order records stored in orders.txt file. The imported data is as follows:

ORDERID CLIENT     SELLERID AMOUNT ORDERDATE NOTE

1       287  47     5825         2013-05-31       gafcaghafdgie f ci…

2       89     22     8681         2013-05-04       gafcaghafdgie f ci…

3       47     67     7702         2009-11-22       gafcaghafdgie f ci…

4       76     85     8717         2011-12-13       gafcaghafdgie f ci…

5       307  81     8003         2008-06-01       gafcaghafdgie f ci…

6       366  39     6948         2009-09-25       gafcaghafdgie f ci…

7       295  8       1419         2013-11-11       gafcaghafdgie f ci…

8       496  35     6018         2011-02-18       gafcaghafdgie f ci…

9       273  37     9255         2011-05-04       gafcaghafdgie f ci…

10     212  0       2155         2009-03-22       gafcaghafdgie f ci…

note field is the string field for increasing each record’s length and hasn’t any practical meaning.

Criteria for data query and filtering: client is 191 and orderdate is between 2013-09-01 and 2013-11-01.

Data volume: 28G

Hardware configuration of the test machine: Normal PC

CPU: Core(TM) i5-3450  Four cores, four threads

Memory capacity: 16GB

SSD

esProc script select.dfx for data filtering:

  A B C
1 4 =date(“2013-09-01”) =date(“2013-11-01”)
2 fork to(A1) =file(“/ssd/data/orders.txt”).cursor@tz(orderid:string,client:string,
sellerid:string,amount:float,orderdate:date,note:string;,A2:A1)
3   =B2.select(client==”191″ && orderdate>B1&& orderdate<A10).fetch()
4   result B3  
5 =A2.conj()    

Java program for data filtering:

package files;

import java.io.BufferedReader;

import java.io.File;

import java.io.FileInputStream;

import java.io.FileOutputStream;

import java.io.InputStreamReader;

import java.text.ParseException;

import java.text.SimpleDateFormat;

import java.util.ArrayList;

import java.util.Date;

import java.util.HashMap;

import java.util.List;

import java.util.Map;

 

public class BigFilter {

         public static void myBigFilter(Date start,Date end,String client) throws Exception{

                   String path=”/ssd/data/”;

                   SimpleDateFormat sdf = new SimpleDateFormat(“yyyy-MM-dd”);

                   File file = new File(path+”orders.txt”);

                   FileInputStream fis = null;

                   fis = new FileInputStream(file);

                   InputStreamReader input = new InputStreamReader(fis);

                   BufferedReader br = new BufferedReader(input);

                   String line=null;

                   long len=0;

                   int index=-1;

                   List resultList= new ArrayList();

                   if ((line = br.readLine())== null) return;

                   while((line = br.readLine())!= null){

                            String tmp_line=line;

                            len++;

                            index=line.indexOf(“\t”);

                                     String orderid1=line.substring(0,index);

                                     line=line.substring(index+1);

                                     index=line.indexOf(“\t”);

                                     String client1=line.substring(0,index);

                                     line=line.substring(index+1);

                                     index=line.indexOf(“\t”);

                                     String sellerid1=line.substring(0,index);

                                     line=line.substring(index+1);

                                     index=line.indexOf(“\t”);

                                     float amount1=Float.parseFloat(line.substring(0,index));

                                     line=line.substring(index+1);

                                     index=line.indexOf(“\t”);

                                     Date orderdate1 =sdf.parse(line.substring(0,index));

                                     line=line.substring(index+1);

                                     String note1=line;

                                     if (client1.equals(client)){

                                               Map<String,Object> emp=new HashMap<String,Object>();

                                               emp.put(“orderid”,orderid1);

                                               emp.put(“client”,client1);

                                               emp.put(“sellerid”,sellerid1);

                                               emp.put(“amount”,amount1);

                                               emp.put(“orderdate”,orderdate1);

                                               emp.put(“note”,note1);

                                               resultList.add(emp);

                                     }

                   }

        System.out.println(“len=”+len);

         }

 

         public static void main(String[] args) throws Exception {

                   SimpleDateFormat df = new SimpleDateFormat(“yyyy-MM-dd HH:mm:ss”);

                   Date begin=new Date();

                   System.out.println(“begin:”+df.format(begin));

                   Date start=new SimpleDateFormat(“yyyy-MM-dd”).parse(“2013-09-01”);

                   Date end=new SimpleDateFormat(“yyyy-MM-dd”).parse(“2013-11-01″);

                   String client=”191”;

                   myBigFilter(start,end,client);

                   long diff = (new Date()).getTime() – begin.getTime();

                   System.out.println(“::end::time=”+diff/1000);

         }

}

        

Perl program for data filtering:

#!/usr/bin/perl -w

 

my ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = localtime;

$year += 1900;

$mon += 1;

$begin=time();

my $datetime = sprintf (“begin:%d-%02d-%02d %02d:%02d:%02d”, $year,$mon,$mday,$hour,$min,$sec);

print $datetime.”\n”;

$start= 20130901;

$end = 20131101;

open(FILE_IN,”/ssd/data/orders.txt”) or die “Can’t open txt, $!”;

        <FILE_IN>;

        $perIns = <FILE_IN>;

        while(defined($perIns = <FILE_IN>))

        {

                @row=split(“\t” , $perIns);

                $orderid=@row[0];

                $client=$row[1];

                $sellerid=$row[2];

                $amount=$row[3];

                $orderdate = $row[4];

                @row1=split(” “,$row[4]);#print “@row1” . “\n”;

                @row2=split(“-“,$row1[0]);#print “@row2” . “\n”;

                $orderdate= $row2[0]*10000+$row2[1]*100+$row2[2];

                if ($client==191 && $orderdate>$start && $orderdate<$end)

                {

                        push(@inputFileArray,[@row]);

                }

        }

$t=time()-$begin;

print “time:$t”. “\n”;

Test result:

  esProc (Single thread) Java Perl esProc (4 threads)
Execution time 534 seconds 394 seconds 604 seconds 159 seconds

Conclusion:

Compared with Java, esProc, a language interpreted and implemented by Java, suffers only slight performance loss using single thread in handling a computational task like data filtering. Java hardcoding is no more than twice times as fast as esProc. Both being the interpreted language, Java-based esProc overtakes C-based Perl in performance.

Using multithreaded processing, esProc’s performance is significantly improved and esProc code is simple. Java’s performance will be much improved too when using multithreads, but with complex code. Perl hasn’t advantages with its complex multithreading code and only adequate performance using single thread.

On the whole, esProc is the most practicable one with satisfactory performance and simple enough code.

One more point to note: It is found during the test that Perl functions for converting strings to data of date/time type (Date::Calc package’s Mktime and Time::Piece package’s timegm) have very poor performance. It takes Perl 103 seconds to filter a text file of 1G size, compared with 18 seconds with esProc and 7 seconds with Java. In view of this, the above code adopts the approach of comparing numerical values instead of using conversion functions. With this approach, it takes Perl 20 seconds to filter the text file of 1G size. But you cannot handle more complicated date and time computations. For example, to find out what day it is on a certain month’s last day. Maybe there is more efficient Perl package for handling date/time data, but the impact on the result is little so it is won’t be discussed. Java function String.split() also performs poorly, so indexO() and substring() are used in the above code to split a string apart. 

FAVOR (0)
Leave a Reply
Cancel
Icon

Hi,You need to fill in the Username and Email!

  • Username (*)
  • Email (*)
  • Website