esProc Handles Duplicated Records in CSV files

Uncategorized 900 0

Problem source:http://stackoverflow.com/questions/28976554/removing-duplicates-from-csv-and-rearranging-the-data-into-columns-with-java . 

JAVA doesn’t have the class library for grouping data from a text file or getting distinct values, which results rather complicated coding. In dealing with the duplicated records in a CSV file, esProc can work with JAVA to make it easier. The following example will tell you how esProc works.

dup.csv contains 8 columns, in which there are some duplicated values, as shown below: 

21458952,a1234,Doe,John,technology,support staff,work phone,555-555-5555

21458952,a1234,Doe,John,technology,support staff,work email,johndoe@whatever.net

21458952,a1234,Doe,John,technology,support staff,work pager,555-555-5555

99946133,b9854,Paul,Jane,technology,administration,work phone,444-444-4444

99946133,b9854,Paul,Jane,technology,administration,work email,janepaul@whatever.net

99946133,b9854,Paul,Jane,technology,administration,work pager,444-444-4444

99946133,b9854,Paul,Jane,technology,administration,cell phone,444-444-4444

We need to filter away the duplicated records, get the first 6 columns and rearrange the 7th and the 8th column, according to the rule that work phone will be made the 7th column and work email the 8th column for the new file, and if there is more than one work phone or work email for the same person, the first one will be used.

esProc approach:

  A
1 =file(“D:\\dup.csv”).import(;”,”)
2 =A1.group(_1,_2,_3,_4,_5,_6;~.select@1(_7==”work phone”)._8,~.select@1(_7==”work email”)._8)
3 =file(“D:\\result.csv”).export(A2;”,”)

A1: Import the file separated by commas.

A2: Filter records to remove the duplicated ones and rearrange them. group function is used to group them. ~ represents each group of records; _1,_2…_8 are default column names; @1 indicates getting the first record from the query result. The result is as follows:

 

A3: Export the result to a new CSV file. Or we can use exportxls function to export data as the Excel format.

esProc_text_duplicate_csv

Having done all data processing, the esProc script will then be integrated with JAVA program via JDBC using the following JAVA code:

         // establish a connection via esProc JDBC

         Class.forName(“com.esproc.jdbc.InternalDriver”);

         con= DriverManager.getConnection(“jdbc:esproc:local://”);

         // call esProc script, whose name is test and which receives parameters

         st =(com.esproc.jdbc.InternalCStatement)con.prepareCall(“call test()”);

         st.execute();//execute esProc stored procedure

         ResultSet set = st.getResultSet();//get the result

FAVOR (0)
Leave a Reply
Cancel
Icon

Hi,You need to fill in the Username and Email!

  • Username (*)
  • Email (*)
  • Website