esProc Heterogeneous Datasource – Hadoop

Course 1072 0

esProc’s Hadoop datasource includes mainly HDFS files and Hive database. The way of accessing Hive database is similar to that of accessing other databases, so we won’t discuss it here. esProc has the built-in method of accessing HDFS files, which resembles that of accessing files from an ordinary file system. We’ll introduce how to write an esProc script through the following example. 

The HDFS text file, employee.txt, has the employee data. We are asked to import the data and find out female employees born after January 1, 1981 inclusive and export the result to a file in a compressed format. 

The data of empolyee.txt are as follows:

EID   NAME       SURNAME        GENDER  STATE        BIRTHDAY        HIREDATE         DEPT         SALARY

1       Rebecca   Moore      F       California 1974-11-20       2005-03-11       R&D          7000

2       Ashley      Wilson      F       New York 1980-07-19       2008-03-16       Finance    11000

3       Rachel      Johnson   F       New Mexico     1970-12-17       2010-12-01       Sales         9000

4       Emily         Smith        F       Texas        1985-03-07       2006-08-15       HR    7000

5       Ashley      Smith        F       Texas        1975-05-13       2004-07-30       R&D          16000

6       Matthew Johnson   M     California 1984-07-07       2005-07-07       Sales         11000

7       Alexis        Smith        F       Illinois       1972-08-16       2002-08-16       Sales         9000

8       Megan     Wilson      F       California 1979-04-19       1984-04-19       Marketing        11000

9       Victoria    Davis        F       Texas        1983-12-07       2009-12-07       HR    3000

10     Ryan         Johnson   M     Pennsylvania    1976-03-12       2006-03-12       R&D          13000

11     Jacob        Moore      M     Texas        1974-12-16       2004-12-16       Sales         12000

12     Jessica     Davis        F       New York 1980-09-11       2008-09-11       Sales         7000

13     Daniel       Davis        M     Florida      1982-05-14       2010-05-14       Finance    10000

… 

To develop and debug a program in esProc’s Integration Development Environment (IDE), first copy Hadoop’s core package and configuration package, like commons-configuration-1.6.jar, commons-lang-2.4.jar, and hadoop-core-1.2.1.jar (Hadoop1.2.1) into “esProc installation directory\esProc\lib” directory. 

esProc code for processing the file:

  A
1 =hdfsfile(“hdfs://192.168.1.11:9000/user/employee.txt”,”UTF-8″)
2 =A1.cursor@t()
3 = A2.select(BIRTHDAY>=date(1981,1,1) && GENDER==”F”)
4 =hdfsfile(“hdfs://192.168.1.11:9000/user/emp-result.gz”).export@t(A3)
5 =hdfsfile(“hdfs://192.168.1.11:9000/user/emp-result.gz”).import@t()

A1: Define an HDFS file object. “UTF-8” is a charset for encoding the file; use the JVM charset by default. We can see that an HDFS file object is defined in a similar way to defining one from an ordinary file system. Subsequent computations, like declaring a cursor and importing data, can also be performed with no difference.

A2: A cursor is defined according to A1’s file object, whose first row is imported as the column names with tab being the default separator. With cursors, we can process big files by importing data segmentally.

A3: Filter cursor data according to the specified condition.

A4: Export the filtering results to emp-result.gz, which will be compressed automatically into gzip format in Hadoop because of its .gz extension. Hadoop also supports other compressed formats, like lzo and lz4 (see Hadoop-related documents for details).

A5: Import all data from emp-result.gz, which will be decompressed automatically into an ordinary text file according to the gzip format because of its .gz extension. import function is used to import the whole data into the memory. 

The final result is as follows:

esProc_datasource_hadoop

 

FAVOR (0)
Leave a Reply
Cancel
Icon

Hi,You need to fill in the Username and Email!

  • Username (*)
  • Email (*)
  • Website