Big Data Testing Scenarios

Let us have a look at the scenarios for which Big Data Testing can be used in the Big Data components:-

Data Ingestion :-

This step is considered as pre-Hadoop stage where data is generated from multiple sources and data flows into HDFS. In this step the testers verifies that data is extracted properly and data is loaded into HDFS.

Ensure proper data from multiple data sources is ingested i.e. all required data is ingested as per their defined schema and data not matching schema should not be ingested. Data which has not matched with schema should be stored for stats reporting purpose. Also ensure there is no data corruption.Comparison of source data with data ingested to simply validate that correct data is pushed.Verify that correct data files are generated and loaded into HDFS correctly into desired location.

Data Processing :-

This step is used for validating Map-Reduce jobs. Map-Reduce is a concept used for condensing large amount of data into aggregated data. The data ingested is processed using execution of Map-Reduce jobs which provides desired results. In this step the tester verifies that ingested data is processed using Map-Reduce jobs and validate whether business logic is implemented correctly.

Ensure Map Reduce Jobs run properly without any exceptions.Ensure key-value pairs are correctly generated post MR Jobs.Validate business rules are implemented on data.Validate data aggregation is implemented on data and data is consolidated post reduce operations.Validate that data is processed correctly post Map-Reduce Jobs by comparing output files with input files.

Note: – For validation at data ingestion or data processing layers, we should use a small set of sample data (in KB’s or MB). By using a small sample data we can easily verify that correct data is ingested by comparing source data with output data at ingestion layer. It becomes easier to verify that MR jobs are run without any error, business rules are correctly implemented on ingested data and validate data aggregation is correctly done by comparing output file with input file.

Initially for testing at data ingestion or data processing layers if we use large data (in GB’s), it becomes very difficult to validate or verify each input record with output record and validating whether business rules are implemented correctly becomes difficult.

Data Storage :-

This step is used for storing output data in HDFS or any other storage system (such as Data Warehouse). In this step the tester verifies that output data is correctly generated and loaded into storage system.

Validate data is aggregated post Map-Reduce Jobs.Verify that correct data is loaded into storage system & discard any intermediate data which is present.Verify that there is no data corruption by comparing output data with HDFS (or any storage system) data.

The other type of testing scenarios a Big Data Tester can do is:-

Check whether proper alert mechanisms are implemented such as Mail on alert, sending metrics on Cloud watch etc.Check Exceptions or errors are displayed properly with appropriate exception message so that solving an error becomes easy.Performance testing to test the different parameters to process a random chunk of large data and monitor parameters such as time taken to complete Map-Reduce Jobs, memory utilization, disk utilization and other metrics as required.Integration testing for testing complete workflow directly from data ingestion to data storage/visualization.Architecture testing for testing that Hadoop is highly available all the time & Failover services are properly implemented to ensure data is processed even in case of failure of nodes.

Note: – For testing it is very important to generate data for testing covering various test scenarios (positive and negative). Positive test scenarios cover scenarios which are directly related to the functionality. Negative test scenarios cover scenarios which do not have direct relation with the desired functionality.

List of few tools used in Big Data

Data Ingestion – Kafka, Zookeeper, Sqoop, Flume, Storm, Amazon Kinesis.
Data Processing – Hadoop (Map-Reduce), Cascading, Oozie, Hive, Pig.
Data Storage – HDFS (Hadoop Distributed File System), Amazon S3, HBase.