Getting Started

Introduction - What is BigDime?

Stands for "Big Data Ingestion Made Easy" and it really means it. BigDime provides a programming and configuration model for building data adapters which can then be used to ingest large volume data sets from various data sources into Big Data (HDFS, Hive & HBase) and complements an existing Big Data ecosystems and other ingestions tools such as Flume, etc. BigDime leverages Apache Flume to build the framework's core component. BigDime is now released with few data adaptors (File & Kafka data adaptors; more adaptors in progress) to start with and ready to run right out of the box, and can be extended to add more customization for data cleansing and data standardization to be executed while data being ingested. BigDime does validate the data upon ingesting the data into Big Data and more business validation can be added. Also, comes with Metadata Management and Management Console to keep up with changes in Source data set and to alert when encountering any critical event during ingestion.

BigDime is:

  • Not an analytical tool
  • Not a data wrangler or an aggregator.
  • Not a replacement for big data ecosystem.
  • Not an enterprise data bus.
  • Inspiration

    Data from variety of sources needs to be ingested in big data. The data ingestion system must be reliable, easy to use, highly available, configurable and extensible. These requirements are common to most of the data engineering and analytics oriented applications. There are solutions available that solve these problems but there is no ecosystem that deals with all these problems together. BigDime has been designed to solve these problems and also allows open source community to contribute as well.

    Approach

    BigDime emphasizes on providing end to end support for data ingestion by providing following capabilities:

    The framework provides out of the box handlers using which data ingestion can be achieved by creating the BigDime adaptor configuration file. Existing Flume applications can be ported into BigDime.

    Let's see what is BigDime at very high level:

    highlevel

    Use Case Throughput (MB/Sec) Event Size (KB) Total events (per second) Status
    File (4 files in parallel) 116 1024 116 Stable Release Candidate
    Streaming ( reading 6 partitions ) 5 1 6000 Stable Release Candidate
    SQL ~1 NA NA 1.0 Release Candidate


    Architecture

    How It Works?

    Understanding the components and flow would be best way to get to know BigDime better.

    highlevel

    Drilling down further to understand flow, how BigDime works! Lets look under the hood.

    overall flow

    What is an Adaptor?

    An Adaptor is single unit of work horse to ingest data from a source to BigData. The adaptor consists of Source, Channel and Sink modules. Source and Sink modules are composed with handlers in sequence. Each Handler is designed to do a single simple task and hand over to next handler. Channel is a queue to stage the data to be consumed by Sink. Channel gives freedom for Source and Sink to work on its own pace.

    The adaptor can be configured, started, and stopped. Several Adaptors could run concurrently at various schedules within BigDime Framework.

    Adaptor Flow

    image

    Adaptor Configuration

    Rules

    Adaptor Context

    The adaptor sets basic context when it's instantiated. Context is loaded once when adaptor started and provides information such as source, sink, handeler chain, properties to adaptor that's running.

    Adaptor Runtime Information Management (RTIM)

    Runtime Information Management keeps track of adaptor execution by storing the information that ensures the continous data ingestion. RTI helps maintain an offset information of data source entities and its incremental bookmark to recover in case of failure. It includes last adaptor run date-time, current incremental value, file names, recovery data etc..

    Source

    Source's responsibility is to retrieve the data from adaptor's data source, perform certain operations on it by passing the data through a set of handlers and then and submit it to the Channel. In BigDime, the Source system is built by stitching various handlers together. First handler in the source typically reads data from external data source, e.g. Kafka topic, RDBMS, File etc. and hands over the next handler(s) to perform any cleansing, standardizing and/or transforming data such as removing null values, decoding Avro schema,etc. befor hands over to Channel.

    Channel

    Channels are used to store the until sink is able to consume it. The storage can be persistent or non-persistent depending on the which Channel implementation is used.

    Type of channels
    Implementations

    This BigDime implementation is using Memory Channel. However, Kafka can also be used as a Channel.

    Memory Channel: Memory channel is a non-persistent channel and uses local heap to store data.

    Data Structure for Memory Channel

    MemoryChannel in bigdime uses ArrayList as backing data structure, in order to support replicating and multiplexing functionalities.

    Channel Operations
    Known Limitations:

    Sink

    Sink consumes data from Channel and writes it to data store, e.g. HDFS. Sink consists chain of handlers to perform operations such as writing into big data and validate the data upon sinking.

    Handler

    A handler is out there to handle one and only one situation, be it reading data from a file, or parsing a file or translating data from one format to another etc. A user can implement his/her own logic as a handler. Here are few examples:

    handler manager

    Meta Data Management

    Metadata management module stores metadata of an adaptor that includes but not limited to schema of source, target and its data types. It allows the adaptor to adapt to oncoming changes from source at runtime. for example

    Metadata API attempts to overcome the problem by storing the semantics of each meta data element of any schema.

    Overview of Metadata Management

    meta data overview

    Highlevel Metadata Concurrency Flow

    concurrency flow

    Data Validation Service

    BigDime comes with few out-of-the-box data Validations to validate the data being ingested and alert when validation fails.

    Adaptor can have any number validation handlers(need to custom written) sequnenced in both Source as well as Sink. The Validation Handler contains collection of validators. with that said, new custom validatior can be created to meet your business needs. Validation Handler can consist of several Data Validators, which could be extended from framework to meet your business needs. Configure customized validations by adding validation type value under properties in data-validation-handler. Validating data between source and target by types of validators based on data source(file, sql, etc). Also, raising alerts when validations fail.

    Type of validation:

    Data Validation:
    Schema Validation:

    Management Console

    Management Console would provide a one stop shop for all the alerting needs of BigDime. The Management console is seamlessly integrated with the monitoring rest services and single interface can serve multiple environments such as production, test and dev. It would provide users with a quick view of any alerts that might have occurred.

    BigDime Monitoring Service

    What:

    Monitoring Service provides a rest interface for monitoring needs.It provides rest services to fetch data from the back end persistent store.

    Why:

    Plugability. The rest services acts an interface between the backend implementation and Management Console.

    Features:

    Monitoring Service provides services to fetch data in multiple fashions.Application implemtor is could use a default like last x days which is customizable via external properties or the user can make call with the specific time range for a given application.

    Sample Alert & Info Messages

    Below is a sample Alert message

    2015-08-28 08:35:27,314 priority=ERROR adaptor_name="file-data" alert_severity="BLOCKER" message_context="data reading phase" alert_code="BIG-0001" alert_name="ingestion failed" alert_cause="data validation failed" detail_message="data validation failed, topic=clickstream" filepath=/path filename= file.txt input checksum = #01234567, output checksum = #76543210

    2015-08-28 08:35:27,314 priority=ERROR adaptor_name="clickstream-data" alert_severity="BLOCKER" message_context="data reading phase" alert_code="BIG-0001" alert_name="ingestion failed" alert_cause="data validation failed" detail_message="data validation failed, topic=clickstream" partition = 0 startoffset=2180192 ,input checksum = #01234567, output checksum = #76543210

    Below is a sample info message

    2015-08-28 08:35:28,314 priority=INFO adaptor_name="clickstream-data" message_context="data reading phase" detail_message="read 1048576 bytes" topic=clickstream" partition = 0 startoffset=2180192

    Community

    Release Status Notes
    0.9 Stable Release Candidate Intial Commited Release
    1.0 In Development Development in progress

    Get help using BigDime or contribute to the project

    BigDime currently use github manage the downloads, documentation, issues and sourcecode.

    Fork me on github

    Documentation

    Group

    Report A Bug

    How To Contribute?

    BigDime follow up the standard fork/pull model.

    Look forward for your contribution!!!

    Contributors

    Acknowledgements

    Download & Install

    Obtain Distribtion

    Download

    The release can be downloaded from here. Download the bigdime-dist-${version}-bin.tar.gz file.

    -OR-

    Build

    Install

    Run

    Run the following command from /path/bigdime/bigdime-dist-${version}:

    java -jar -Dloader.path=./config/ -Denv.properties=application.properties bigdime-adaptor-${version}.jar

    Logs/Troubleshooting