esProc Handles Duplicated Records in CSV files

Problem source：http://stackoverflow.com/questions/28976554/removing-duplicates-from-csv-and-rearranging-the-data-into-columns-with-java .

JAVA doesn’t have the class library for grouping data from a text file or getting distinct values, which results rather complicated coding. In dealing with the duplicated records in a CSV file, esProc can work with JAVA to make it easier. The following example will tell you how esProc works.

dup.csv contains 8 columns, in which there are some duplicated values, as shown below:

21458952,a1234,Doe,John,technology,support staff,work phone,555-555-5555

21458952,a1234,Doe,John,technology,support staff,work email,johndoe@whatever.net

21458952,a1234,Doe,John,technology,support staff,work pager,555-555-5555

99946133,b9854,Paul,Jane,technology,administration,work phone,444-444-4444

99946133,b9854,Paul,Jane,technology,administration,work email,janepaul@whatever.net

99946133,b9854,Paul,Jane,technology,administration,work pager,444-444-4444

99946133,b9854,Paul,Jane,technology,administration,cell phone,444-444-4444

We need to filter away the duplicated records, get the first 6 columns and rearrange the 7^th and the 8^th column, according to the rule that work phone will be made the 7^th column and work email the 8^th column for the new file, and if there is more than one work phone or work email for the same person, the first one will be used.

esProc approach：

	A
1	=file(“D:\\dup.csv”).import(;”,”)
2	=A1.group(_1,_2,_3,_4,_5,_6;~.select@1(_7==”work phone”)._8,~.select@1(_7==”work email”)._8)
3	=file(“D:\\result.csv”).export(A2;”,”)

A1: Import the file separated by commas.

A2: Filter records to remove the duplicated ones and rearrange them. group function is used to group them. ~ represents each group of records; _1,_2…_8 are default column names; @1 indicates getting the first record from the query result. The result is as follows:

A3: Export the result to a new CSV file. Or we can use exportxls function to export data as the Excel format.

esProc_text_duplicate_csv

Having done all data processing, the esProc script will then be integrated with JAVA program via JDBC using the following JAVA code:

// establish a connection via esProc JDBC

Class.forName(“com.esproc.jdbc.InternalDriver”);

con= DriverManager.getConnection(“jdbc:esproc:local://”);

// call esProc script, whose name is test and which receives parameters

st =(com.esproc.jdbc.InternalCStatement)con.prepareCall(“call test()”);

st.execute();//execute esProc stored procedure

ResultSet set = st.getResultSet();//get the result

Hi，You need to fill in the Username and Email！