Thursday, April 25, 2013

Big Data Project dictionary

Here is the list of interesting projects associated with big data ecosystem.

Ambari: The Apache Ambari project is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs. [Apache]

Avro: Data serialization engine for distributed/big data. [Apache]

BigTable: Google's non-relational system that includes terabytes of memory and petabytes of storage which can deal with millions of reads or writes per second.[Google]

Chubby: Google's lock service for loosely coupled. [Google]

Chukwa: Chukwa is an open source data collection system for monitoring large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a flexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.

Colossus: It is successor to the Google File System (GFS) which enables real-time processing. [Google]

D3: D3 is a javascript document visualization library that revolutionizes how powerfully and creatively we can visualize information.

Dremel: Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. [Google]

DRILL: Apache Drill provides low latency ad-hoc queries to many different data sources, including nested data. Inspired by Google's Dremel, Drill is designed to scale to 10,000 servers and query petabytes of data in seconds. [Apache]

Flume: Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. [Apache]

GFS: Google's distributed file system, inspiration behind Apache Hadoop HDFS. [Google]

Giraph: Apache Giraph is an iterative graph processing system built for high scalability. For example, it is currently used at Facebook to analyze the social graph formed by users and their connections. Giraph originated as the open-source counterpart to Pregel, the graph processing architecture developed at Google. [Apache]

Gizzard: Twitter's flexible sharding framework for creating eventually-consistent distributed datastores. [Twitter]

Gora: The Apache Gora open source framework provides an in-memory data model and persistence for big data. Gora supports persisting to column stores, key value stores, document stores and RDBMSs, and analyzing the data with extensive Apache Hadoop MapReduce support. [Apache]

Gremlin: Gremlin is graph-based programming language developed for key-value pair, multi-relational graphs.

Hbase: Use Apache HBase when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, column-oriented store modeled after Google's Bigtable. [Apache][Inspired from Google BigTable]
HDInsight: Microsoft's service that deploys and provisions Apache Hadoop clusters in the cloud (Azure), providing a software framework designed to manage, analyze and report on big data

Hive: Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL.

Hue: Hue is web UI for Hadoop, open source project started by Cloudera.

Impala: Open source, distributed SQL query engine for real-time querying on Apache Hadoop. [Initiated by Cloudera]

Kafka: Kafka is a distributed, partitioned, replicated commit log service. It provides the functionality of a messaging system, but with a unique design.[Apache]

Lucene: A high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform [Apache]
Mahout: Service to implement scalable or distributed machine learning algorithm on Hadoop platform.

Oozie: Engine for running workflows based on time and data triggers.

Pig: high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.

Pregel: Google's large scale graph computing system. [Google]

Prism: Facebook's big data management framework.

Rhadoop: Rhadoop is an open source project spearheaded by Revolution Analytics to grant data scientists access to Hadoop’s scalability from language, R. [Revolution Analytics]

SAP Hana: SAP AG’s implementation of in-memory database technology.[SAP]

Spanner: Google's globally distributed relational database management system (RDBMS), the successor to BigTable. [Google]

Spark: Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write



Sqoop: Import/export data from/to relational database to Hadoop clusters.

Storm: It is a distributed and fault-tolerant realtime computation system. [Twitter]

Zookeeper: A centralized service for maintaining configuration, information, naming etc.


No comments:

Post a Comment