Here is the list of interesting projects associated with big data ecosystem.
Ambari: The Apache Ambari project is aimed at making Hadoop management simpler
by developing software for provisioning, managing, and monitoring Apache
Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop
management web UI backed by its RESTful APIs. [Apache]
Avro: Data serialization engine for distributed/big data. [Apache]
BigTable: Google's non-relational system that includes terabytes of memory and petabytes of storage which can deal with millions of reads or writes per second.[Google]
Chubby: Google's lock service for loosely coupled. [Google]
Chukwa: Chukwa is an open source data collection system for monitoring large distributed systems.
Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits
Hadoop’s scalability and robustness. Chukwa also includes a flexible and powerful toolkit for displaying, monitoring
and analyzing results to make the best use of the collected data.
Colossus: It is successor to the Google File System (GFS) which enables real-time processing. [Google]
D3: D3 is a javascript document visualization library that revolutionizes how powerfully and creatively we can visualize information.
Dremel: Dremel is a scalable, interactive ad-hoc query system for
analysis of read-only nested data. By combining multi-level execution trees and
columnar data layout, it is capable of running aggregation queries over trillion-row
tables in seconds. [Google]
DRILL: Apache Drill provides low latency ad-hoc queries to many different data
sources, including nested data. Inspired by Google's Dremel, Drill is
designed to scale to 10,000 servers and query petabytes of data in
seconds. [Apache]
Flume: Flume is a distributed, reliable, and available service for
efficiently collecting, aggregating, and moving large amounts of log
data. It has a simple and flexible architecture based on streaming
data flows. It is robust and fault tolerant with tunable reliability
mechanisms and many failover and recovery mechanisms. [Apache]
GFS: Google's distributed file system, inspiration behind Apache Hadoop HDFS. [Google]
Giraph: Apache Giraph is an iterative graph processing system built for
high scalability. For example, it is currently used at Facebook to
analyze the social graph formed by users and their connections. Giraph
originated as the open-source counterpart to Pregel, the graph processing architecture developed at Google. [Apache]
Gizzard: Twitter's flexible sharding framework for creating eventually-consistent distributed datastores. [Twitter]
Gora: The Apache Gora open source framework provides an in-memory data model and persistence
for big data. Gora supports persisting to column stores, key value stores, document
stores and RDBMSs, and analyzing the data with extensive
Apache Hadoop MapReduce
support. [Apache]
Gremlin: Gremlin is graph-based programming language developed for key-value pair, multi-relational graphs.
Hbase: Use Apache HBase when you need random, realtime read/write access to
your Big Data.
This project's goal is the hosting of very large tables -- billions
of rows X millions of columns -- atop clusters of commodity hardware.
Apache HBase is an open-source, distributed, versioned, column-oriented
store modeled after Google's Bigtable. [Apache][Inspired from Google BigTable]
HDInsight: Microsoft's service that deploys and provisions Apache Hadoop clusters in the
cloud (Azure), providing a software framework designed to manage, analyze and
report on big data
Hive: Hive is a data warehouse system for Hadoop that facilitates
easy data summarization, ad-hoc queries, and the analysis of large
datasets stored in Hadoop compatible file systems. Hive provides a
mechanism to project structure onto this data and query the data using a
SQL-like language called HiveQL.
Hue: Hue is web UI for Hadoop, open source project started by Cloudera.
Impala: Open source, distributed SQL query engine for real-time querying on Apache Hadoop. [Initiated by Cloudera]
Kafka: Kafka is a distributed, partitioned, replicated commit log service. It
provides the functionality of a messaging system, but with a unique
design.[Apache]
Lucene: A high-performance, full-featured text search engine library written
entirely in Java. It is a technology suitable for nearly any application
that requires full-text search, especially cross-platform [Apache]
Mahout: Service to implement scalable or distributed machine learning algorithm on Hadoop platform.
Mahout: Service to implement scalable or distributed machine learning algorithm on Hadoop platform.
Oozie: Engine for running workflows based on time and data triggers.
Pig: high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs.
Pregel: Google's large scale graph computing system. [Google]
Prism: Facebook's big data management framework.
Rhadoop: Rhadoop is an open source project spearheaded by Revolution Analytics to
grant data scientists access to Hadoop’s scalability from language, R. [Revolution Analytics]
SAP Hana: SAP AG’s implementation of in-memory database technology.[SAP]
Spanner: Google's globally distributed relational database management system (RDBMS), the successor to BigTable. [Google]
Spark: Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write
Sqoop: Import/export data from/to relational database to Hadoop clusters.
Storm: It is a distributed and fault-tolerant realtime computation system. [Twitter]
Zookeeper: A centralized service for maintaining configuration, information, naming etc.
No comments:
Post a Comment