How Do We Loop-The-Data-Loop With Hadoop?

Pervasive Software has this month released Data Integrator v10 – Hadoop Edition to the Apache Hadoop user community with a view to helping the flow of data both to and from Hadoop-based big data stores.

The argument here is that developers are now tasked with integrating big data sets into and out of environments where devices (and the applications they are running) may be working with what can only logically be called “little data” by comparison.

Pervasive CTO Mike Hoskins argues that programmers will need to find the agility to combine and process data from all their operations within the new highly scalable data stores of Hadoop.

“The combination of our high performance HDFS and HBase connectors and Pervasive Data Integrator visual ETL tooling eradicates the need for custom MapReduce code for executing data import-export operations,” said Hoskins.

NOTE: Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations. HBase is the Hadoop database.

“I’m particularly jazzed about our high-performance HBase loading. For the first time, users can (with a single click) move data from traditional data stores including DB2, MySQL, Netezza, PostgreSQL, SQLServer, Oracle, Teradata, and Vertica directly into HBase, the dominant NoSQL database provided free with all Hadoop distributions,” Hoskins added.

Industry comments suggest that Hadoop may now need visual data integration tooling of this kind — especially given the need to execute increasingly complex workloads against massive amounts of data at high speed. Demand for powerful big data analytic platforms appears to be coming at us quickly right now.

Pervasive says it is helping to bridge non-Hadoop data more easily into Hadoop with no MapReduce code — and this is the loop-the-loop balancing act that now needs to be pulled off.