esProc’s Hadoop datasource includes mainly HDFS files and Hive database. The way of accessing Hive database is similar to that of accessing other databases, so we won’t discuss it here. esProc has the built-in method of accessing HDFS files, which resembles that of accessing files from an ordinary file system. We’ll introduce how to write an esProc script through the following example.
The HDFS text file, employee.txt, has the employee data. We are asked to import the data and find out female employees born after January 1, 1981 inclusive and export the result to a file in a compressed format.
The data of empolyee.txt are as follows:
EID NAME SURNAME GENDER STATE BIRTHDAY HIREDATE DEPT SALARY
1 Rebecca Moore F California 1974-11-20 2005-03-11 R&D 7000
2 Ashley Wilson F New York 1980-07-19 2008-03-16 Finance 11000
3 Rachel Johnson F New Mexico 1970-12-17 2010-12-01 Sales 9000
4 Emily Smith F Texas 1985-03-07 2006-08-15 HR 7000
5 Ashley Smith F Texas 1975-05-13 2004-07-30 R&D 16000
6 Matthew Johnson M California 1984-07-07 2005-07-07 Sales 11000
7 Alexis Smith F Illinois 1972-08-16 2002-08-16 Sales 9000
8 Megan Wilson F California 1979-04-19 1984-04-19 Marketing 11000
9 Victoria Davis F Texas 1983-12-07 2009-12-07 HR 3000
10 Ryan Johnson M Pennsylvania 1976-03-12 2006-03-12 R&D 13000
11 Jacob Moore M Texas 1974-12-16 2004-12-16 Sales 12000
12 Jessica Davis F New York 1980-09-11 2008-09-11 Sales 7000
13 Daniel Davis M Florida 1982-05-14 2010-05-14 Finance 10000
…
To develop and debug a program in esProc’s Integration Development Environment (IDE), first copy Hadoop’s core package and configuration package, like commons-configuration-1.6.jar, commons-lang-2.4.jar, and hadoop-core-1.2.1.jar (Hadoop1.2.1) into “esProc installation directory\esProc\lib” directory.
esProc code for processing the file:
A | |
1 | =hdfsfile(“hdfs://192.168.1.11:9000/user/employee.txt”,”UTF-8″) |
2 | =A1.cursor@t() |
3 | = A2.select(BIRTHDAY>=date(1981,1,1) && GENDER==”F”) |
4 | =hdfsfile(“hdfs://192.168.1.11:9000/user/emp-result.gz”).export@t(A3) |
5 | =hdfsfile(“hdfs://192.168.1.11:9000/user/emp-result.gz”).import@t() |
A1: Define an HDFS file object. “UTF-8” is a charset for encoding the file; use the JVM charset by default. We can see that an HDFS file object is defined in a similar way to defining one from an ordinary file system. Subsequent computations, like declaring a cursor and importing data, can also be performed with no difference.
A2: A cursor is defined according to A1’s file object, whose first row is imported as the column names with tab being the default separator. With cursors, we can process big files by importing data segmentally.
A3: Filter cursor data according to the specified condition.
A4: Export the filtering results to emp-result.gz, which will be compressed automatically into gzip format in Hadoop because of its .gz extension. Hadoop also supports other compressed formats, like lzo and lz4 (see Hadoop-related documents for details).
A5: Import all data from emp-result.gz, which will be decompressed automatically into an ordinary text file according to the gzip format because of its .gz extension. import function is used to import the whole data into the memory.
The final result is as follows: