Hadoop is not a Low Cost Solution

Uncategorized 798 0

Hadoop is a popular Parallel Computing System for Big Data, with the main feature of being cheap and able to scale out. However, “cheap” is limited to its hardware and software licensing, not the total cost. In terms of learning cost, development cost and management cost, Hadoop does not have advantage. In some cases, the hardware cost is not an advantage, either.

High learning costs. Hadoop has a huge ecosystem, including dozens of related products, such as MapReduce,  HDFS, Hive, HBase, YARN, Zookeeper, Avro, JAQL, PIG, solr, mesos, Shark, stream and storm. The cost is high to get familiar with the steps required to deploy these products, to know their features and to understand the development methods.

It is not easy to pick up the right version of Hadoop for your product. The version of  Hadoop is in a chaotic status, with two generations each including over a dozen of versions. Every version comes out with different new features. Among them the relatively stable versions are 0.20.x, 0.21.x, 0.22.x, 1.0.x, 0.23.x and 2.x, etc..

hadoop_ecosystem

This is just the beginning.

As Hadoop is not mature enough, and not fully merchandized, users cannot stop learning after entering the development stage. In this way, they can identify products with bugs, or defective products and versions in a timely manner. Although there are commercial versions which are more mature, such as Cloudera or HorontWorks, they are not yet stable enough. In addition, before designing and development, don’t forget to learn the development framework of Hadoop, as its structure is very complex and require deep diving into the source codes to get the answers.

High development costs. Hadoop is based on MapReduce programming. MapReduce lacks the underlying functions and is not good at computation. Specifically, in the case of computation functions for structured data, it offers none. It requires the programmer to implement these algorithms: filtering, statistics, unique value, intersection, sorting, ranking, comparison with previous period, comparison with the same period, the relative position computation, interval computation, etc. Each of these requires dozens or hundreds of lines of code to be written, and is difficult to be used elsewhere. Every time they need to be written for a new task.

To offset its weakness due to the lack of functions for structured data computation, Hive and other tools appeared, to add SQL-like characteristics to Hadoop. However, this kind of SQL-like functionality is quite limited, and still a long way to go as compared with window functions or stored procedures. Users are frequently forced to implement some functionality with MapReduce, where they cannot achieve with HiveQL.

In order to implement normal business logics in business computing, Hadoop users need to hire more experienced programmers, and build a luxury development team. It’s a high cost in terms of timeand labor. Even so, there are still a lot of complex business logics which are difficult to be implemented with Hadoop.

High hardware costs for small clusters.If the cluster size is small, for example when the machine number is less than ten, the hardware costs of Hadoop are sometimes higher than the database. The reason is that the strong fault tolerance feature of Hadoop tends to split the task into very small pieces, and distribute them to different nodes for handling. The intermediate results are stored in the file system. This results in high overhead for scheduling and high hard disk IO, lowering the performance.

More servers are needed to achieve the desired performance. The database are mainly for in-memory batch data computing with a much higher performance. With the same configuration a single database equals multiple Hadoop node machines, and sometimes at lower cost.

hardware_clusters

High management costs. Hadoop suffers from high hardware costs with small cluster. Only when the cluster is large enough, the hardware cost can be lowered. This is the core strength of Hadoop: cheap scale-out ability. Large clusters have low unit cost in terms of hardware, but the high management costs tend to offset this advantage. For example, the cost of electricity, which is usually ignored in small cluster scenario, now becomes very important. Large clusters mean large number of nodes, and huge energy consumption. This is why the data center is often located near the power plants.

Meanwhile, additional cost for data center space cannot be ignored, either. Large clusters need more racks, more room, more complex redundant power systems, which all translate into construction costs or room rentals. Finally, large clusters require more maintenance and management staff to operate, at high cost too.

Overall, Hadoop is not a low cost solution. Users should not be confused by its lower hardware costs. They should select a solution with lower total cost based on their specific requirements. Databases, open source or free software are all alternatives to be considered.

In places where the costs of electricity, space and labor are high, users can consider using database. In addition to the cost saving in data center construction, the learning and development costs for database are much lower than Hadoop.

If a user wants to do real-time analysis on large data stream, open source project Hydra is a better choice.Its underlying architecture are designed for stream data, such as log files, etc. The performance is higher than Hadoop.

For mid and small sized clusters, free software esProc is a better choice.It has low cost and is good at business computing with complex business rules. Its performance is several times higher than Hadoop.

In short, we should bear in mind what Hortonworks’ CTO Eric Baldeschwieler said about costs: hardware costs account for only 20% of total costs for a Hadoop data center.

FAVOR (0)
Leave a Reply
Cancel
Icon

Hi,You need to fill in the Username and Email!

  • Username (*)
  • Email (*)
  • Website