Teradata – New Technology RDBMS

Teradata is a popular Relational Database Management System (RDBMS) suitable for large data warehousing applications. It is capable of handling large volumes of data and is highly scalable

It is mainly suitable for building large scale data warehousing applications. Teradata achieves this by the concept of parallelism. It is developed by the company called Teradata.

Teradata Features :


  • Unlimited Parallelism − Teradata database system is based on Massively Parallel Processing (MPP) Architecture. MPP architecture divides the workload evenly across the entire system. Teradata system splits the task among its processes and runs them in parallel to ensure that the task is completed quickly.
  • Shared Nothing Architecture − Teradata’s architecture is called as Shared Nothing Architecture. Teradata Nodes, its Access Module Processors (AMPs) and the disks associated with AMPs work independently. They are not shared with others.
  • Linear Scalability − Teradata systems are highly scalable. They can scale up to 2048 Nodes. For example, you can double the capacity of the system by doubling the number of AMPs.
  • Connectivity − Teradata can connect to Channel-attached systems such as Mainframe or Network-attached systems.
  • Mature Optimizer − Teradata optimizer is one of the matured optimizer in the market. It has been designed to be parallel since its beginning. It has been refined for each release.
  • SQL − Teradata supports industry standard SQL to interact with the data stored in tables. In addition to this, it provides its own extension.
  • Robust Utilities − Teradata provides robust utilities to import/export data from/to Teradata system such as FastLoad, MultiLoad, FastExport and TPT.
  • Automatic Distribution − Teradata automatically distributes the data evenly to the disks without any manual intervention

    Components of Teradata

    The key components of Teradata are as follows −

    • Node − It is the basic unit in Teradata System. Each individual server in a Teradata system is referred as a Node. A node consists of its own operating system, CPU, memory, own copy of Teradata RDBMS software and disk space. A cabinet consists of one or more Nodes.
    • Parsing Engine − Parsing Engine is responsible for receiving queries from the client and preparing an efficient execution plan. The responsibilities of parsing engine are −
      • Receive the SQL query from the client
      • Parse the SQL query check for syntax errors
      • Check if the user has required privilege against the objects used in the SQL query
      • Check if the objects used in the SQL actually exists
      • Prepare the execution plan to execute the SQL query and pass it to BYNET
      • Receives the results from the AMPs and send to the client
    • Message Passing Layer − Message Passing Layer called as BYNET, is the networking layer in Teradata system. It allows the communication between PE and AMP and also between the nodes. It receives the execution plan from Parsing Engine and sends to AMP. Similarly, it receives the results from the AMPs and sends to Parsing Engine.
    • Access Module Processor (AMP) − AMPs, called as Virtual Processors (vprocs) are the one that actually stores and retrieves the data. AMPs receive the data and execution plan from Parsing Engine, performs any data type conversion, aggregation, filter, sorting and stores the data in the disks associated with them. Records from the tables are evenly distributed among the AMPs in the system. Each AMP is associated with a set of disks on which data is stored. Only that AMP can read/write data from the disks.

    Storage Architecture

    When the client runs queries to insert records, Parsing engine sends the records to BYNET. BYNET retrieves the records and sends the row to the target AMP. AMP stores these records on its disks. Following diagram shows the storage architecture of Teradata.

    Storage Architecture

    Retrieval Architecture

    When the client runs queries to retrieve records, the Parsing engine sends a request to BYNET. BYNET sends the retrieval request to appropriate AMPs. Then AMPs search their disks in parallel and identify the required records and sends to BYNET. BYNET then sends the records to Parsing Engine which in turn will send to the client. Following is the retrieval architecture of Teradata.

    Retrieval Architecture

Primary Key

Primary key is used to uniquely identify a row in a table. No duplicate values are allowed in a primary key column and they cannot accept NULL values. It is a mandatory field in a table.

Foreign Key

Foreign keys are used to build a relationship between the tables. A foreign key in a child table is defined as the primary key in the parent table. A table can have more than one foreign key. It can accept duplicate values and also null values. Foreign keys are optional in a table.

Table Types

Types Teradata supports different types of tables.

  • Permanent Table − This is the default table and it contains data inserted by the user and stores the data permanently.
  • Volatile Table − The data inserted into a volatile table is retained only during the user session. The table and data is dropped at the end of the session. These tables are mainly used to hold the intermediate data during data transformation.
  • Global Temporary Table − The definition of Global Temporary table are persistent but the data in the table is deleted at the end of user session.
  • Derived Table − Derived table holds the intermediate results in a query. Their lifetime is within the query in which they are created, used and dropped.

Set Versus Multiset

Teradata classifies the tables as SET or MULTISET tables based on how the duplicate records are handled. A table defined as SET table doesn’t store the duplicate records, whereas the MULTISET table can store duplicate records.


| Leave a comment

Why and How Google MapReduce Technology Works ?

Big Data is a collection of large datasets that cannot be processed using traditional computing techniques. For example, the volume of data Facebook or Youtube need require it to collect and manage on a daily basis, can fall under the category of Big Data. However, Big Data is not only about scale and volume, it also involves one or more of the following aspects − Velocity, Variety, Volume, and Complexity.

Traditional Enterprise Systems normally have a centralized server to store and process data. The following illustration depicts a schematic view of a traditional enterprise system. Traditional model is certainly not suitable to process huge volumes of scalable data and cannot be accommodated by standard database servers. Moreover, the centralized system creates too much of a bottleneck while processing multiple files simultaneously.
Traditional Enterprise System View

Google solved this bottleneck issue using an algorithm called MapReduce. MapReduce divides a task into small parts and assigns them to many computers. Later, the results are collected at one place and integrated to form the result dataset.

How MapReduce Works?

The MapReduce algorithm contains two important tasks, namely Map and Reduce.

The Map task takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key-value pairs).

The Reduce task takes the output from the Map as an input and combines those data tuples (key-value pairs) into a smaller set of tuples.

The reduce task is always performed after the map job.

Let us now take a close look at each of the phases and try to understand their significance.

Input Phase − Here we have a Record Reader that translates each record in an input file and sends the parsed data to the mapper in the form of key-value pairs.

Map − Map is a user-defined function, which takes a series of key-value pairs and processes each one of them to generate zero or more key-value pairs.

Intermediate Keys − They key-value pairs generated by the mapper are known as intermediate keys.

Combiner − A combiner is a type of local Reducer that groups similar data from the map phase into identifiable sets. It takes the intermediate keys from the mapper as input and applies a user-defined code to aggregate the values in a small scope of one mapper. It is not a part of the main MapReduce algorithm; it is optional.

Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It downloads the grouped key-value pairs onto the local machine, where the Reducer is running. The individual key-value pairs are sorted by key into a larger data list. The data list groups the equivalent keys together so that their values can be iterated easily in the Reducer task.

Reducer − The Reducer takes the grouped key-value paired data as input and runs a Reducer function on each one of them. Here, the data can be aggregated, filtered, and combined in a number of ways, and it requires a wide range of processing. Once the execution is over, it gives zero or more key-value pairs to the final step.

Output Phase − In the output phase, we have an output formatter that translates the final key-value pairs from the Reducer function and writes them onto a file using a record writer.

Let us take a real-world example to comprehend the power of MapReduce. Twitter receives around 500 million tweets per day, which is nearly 3000 tweets per second. The following illustration shows how Tweeter manages its tweets with the help of MapReduce.

Tokenize − Tokenizes the tweets into maps of tokens and writes them as key-value pairs.

Filter − Filters unwanted words from the maps of tokens and writes the filtered maps as key-value pairs.

Count − Generates a token counter per word.

Aggregate Counters − Prepares an aggregate of similar counter values into small manageable units.

| Leave a comment

Apache Maven : Fast Automated Build and Deployment in Development

Apache Maven : Automated Build and Deployment

Maven is a project management and comprehension tool. Maven provides developers a complete build lifecycle framework. Development team can automate the project’s build infrastructure in almost no time as Maven uses a standard directory layout and a default build lifecycle.

In case of multiple development teams environment, Maven can set-up the way to work as per standards in a very short time. As most of the project setups are simple and reusable, Maven makes life of developer easy while creating reports, checks, build and testing automation setups.

Maven provides developers ways to manage following:








mailing list

To summarize, Maven simplifies and standardizes the project build process. It handles compilation, distribution, documentation, team collaboration and other tasks seamlessly. Maven increases reusability and takes care of most of build related tasks.

Maven History

Maven was originally designed to simplify building processes in Jakarta Turbine project. There were several projects and each project contained slightly different ANT build files. JARs were checked into CVS.

Apache group then developed Maven which can build multiple projects together, publish projects information, deploy projects, share JARs across several projects and help in collaboration of teams.
Maven Objective

Maven primary goal is to provide developer

A comprehensive model for projects which is reusable, maintainable, and easier to comprehend.

plugins or tools that interact with this declarative model.

Maven project structure and contents are declared in an xml file, pom.xml referred as Project Object Model (POM), which is the fundamental unit of the entire Maven system. Refer to Maven POM section for more detail.
Convention over Configuration

Maven uses Convention over Configuration which means developers are not required to create build process themselves.

Developers do not have to mention each and every configuration detail. Maven provides sensible default behavior for projects. When a Maven project is created, Maven creates default project structure. Developer is only required to place files accordingly and he/she need not to define any configuration in pom.xml.

As an example, following table shows the default values for project source code files, resource files and other configurations. Assuming, ${basedir} denotes the project location:
Item Default
source code ${basedir}/src/main/java
resources ${basedir}/src/main/resources
Tests ${basedir}/src/test
distributable JAR ${basedir}/target
Complied byte code ${basedir}/target/classes

In order to build the project, Maven provides developers options to mention life-cycle goals and project dependencies (that rely on Maven pluging capabilities and on its default conventions). Much of the project management and build related tasks are maintained by Maven plugins.

| Leave a comment

Configure and Execute Test Cases Using Junit Framework

Local Environment Setup

JUnit is a framework for Java, so the very first requirement is to have JDK installed in your machine.

System Requirement

JDK 1.5 or above.
Memory no minimum requirement.
Disk Space no minimum requirement.
Operating System no minimum requirement.

Step 1 – verify Java installation in your machine

Now open console and execute the following java command.

OS Task Command
Windows Open Command Console c:\> java -version
Linux Open Command Terminal $ java -version
Mac Open Terminal machine:~ joseph$ java -version
Let’s verify the output for all the operating systems:

OS Output
Windows java version “1.6.0_21”
Java(TM) SE Runtime Environment (build 1.6.0_21-b07)
Java HotSpot(TM) Client VM (build 17.0-b17, mixed mode, sharing)
Linux java version “1.6.0_21”
Java(TM) SE Runtime Environment (build 1.6.0_21-b07)
Java HotSpot(TM) Client VM (build 17.0-b17, mixed mode, sharing)
Mac java version “1.6.0_21”
Java(TM) SE Runtime Environment (build 1.6.0_21-b07)
Java HotSpot(TM)64-Bit Server VM (build 17.0-b17, mixed mode, sharing)
If you do not have Java installed, install the Java Software Development Kit (SDK) from http://www.oracle.com/technetwork/java/javase/downloads/index.html. We are assuming Java 1.6.0_21 as installed version for this tutorial.

Step 2: Set JAVA environment

Set the JAVA_HOME environment variable to point to the base directory location where Java is installed on your machine. For example

OS Output
Windows Set the environment variable JAVA_HOME to C:\Program Files\Java\jdk1.6.0_21
Linux export JAVA_HOME=/usr/local/java-current
Mac export JAVA_HOME=/Library/Java/Home
Append Java compiler location to System Path.

OS Output
Windows Append the string ;C:\Program Files\Java\jdk1.6.0_21\bin to the end of the system variable, Path.
Linux export PATH=$PATH:$JAVA_HOME/bin/
Mac not required
Verify Java Installation using java -version command explained above.

Step 3: Download Junit archive

Download latest version of JUnit jar file from http://www.junit.org. At the time of writing this tutorial, I downloaded Junit-4.10.jar and copied it into C:\>JUnit folder.

OS Archive name
Windows junit4.10.jar
Linux junit4.10.jar
Mac junit4.10.jar

Step 4: Set JUnit environment

Set the JUNIT_HOME environment variable to point to the base directory location where JUNIT jar is stored on your machine. Assuming, we’ve stored junit4.10.jar in JUNIT folder on various Operating Systems as follows.

OS Output
Windows Set the environment variable JUNIT_HOME to C:\JUNIT
Linux export JUNIT_HOME=/usr/local/JUNIT
Mac export JUNIT_HOME=/Library/JUNIT

Step 5: Set CLASSPATH variable

Set the CLASSPATH environment variable to point to the JUNIT jar location. Assuming, we’ve stored junit4.10.jar in JUNIT folder on various Operating Systems as follows.

OS Output
Windows Set the environment variable CLASSPATH to %CLASSPATH%;%JUNIT_HOME%\junit4.10.jar;.;
Linux export CLASSPATH=$CLASSPATH:$JUNIT_HOME/junit4.10.jar:.
Mac export CLASSPATH=$CLASSPATH:$JUNIT_HOME/junit4.10.jar:.

Step 6: Test JUnit Setup

Create a java class file name TestJunit in C:\ > JUNIT_WORKSPACE

import org.junit.Test;
import static org.junit.Assert.assertEquals;
public class TestJunit {
public void testAdd() {
String str= “Junit is working fine”;
assertEquals(“Junit is working fine”,str);

Create a java class file name TestRunner in C:\ > JUNIT_WORKSPACE to execute Test case(s)

import org.junit.runner.JUnitCore;
import org.junit.runner.Result;
import org.junit.runner.notification.Failure;

public class TestRunner {
public static void main(String[] args) {
Result result = JUnitCore.runClasses(TestJunit.class);
for (Failure failure : result.getFailures()) {

Step 7: Verify the Result

Compile the classes using javac compiler as follows

C:\JUNIT_WORKSPACE>javac TestJunit.java TestRunner.java
Now run the Test Runner to see the result

C:\JUNIT_WORKSPACE>java TestRunner
Verify the output.


| Leave a comment

Connect and Fetch data from Cassendra Using Java Code

1 Connector Class :

import com.datastax.driver.core.Cluster;
import com.datastax.driver.core.Host;
import com.datastax.driver.core.Metadata;
import com.datastax.driver.core.Session;

import static java.lang.System.out;

* Class used for connecting to Cassandra database.
public class CassandraConnector
/** Cassandra Cluster. */
private Cluster cluster;

/** Cassandra Session. */
private Session session;

* Connect to Cassandra Cluster specified by provided node IP
* address and port number.
* @param node Cluster node IP address.
* @param port Port of cluster host.
public void connect(final String node, final int port)
this.cluster = Cluster.builder().addContactPoint(node).withPort(port).build();
final Metadata metadata = cluster.getMetadata();
out.printf(“Connected to cluster: %s\n”, metadata.getClusterName());
for (final Host host : metadata.getAllHosts())
out.printf(“Datacenter: %s; Host: %s; Rack: %s\n”,
host.getDatacenter(), host.getAddress(), host.getRack());
session = cluster.connect();

* Provide my Session.
* @return My session.
public Session getSession()
return this.session;

/** Close cluster. */
public void close()

* Main function for demonstrating connecting to Cassandra with host and port.
* @param args Command-line arguments; first argument, if provided, is the
* host and second argument, if provided, is the port.
public static void main(final String[] args)
final CassandraConnector client = new CassandraConnector();
final String ipAddress = args.length > 0 ? args[0] : “localhost”;
final int port = args.length > 1 ? Integer.parseInt(args[1]) : 9160;
out.println(“Connecting to IP Address ” + ipAddress + “:” + port + “…”);
client.connect(ipAddress, port);


2 Client Class

import com.datastax.driver.core.Cluster;
import com.datastax.driver.core.Host;
import com.datastax.driver.core.Metadata;

import com.datastax.driver.core.Cluster;
import com.datastax.driver.core.Session;
import com.datastax.driver.core.ResultSet;
import com.datastax.driver.core.Row;

public class CassendraClient {
//private Cluster cluster;

private Cluster cluster;
private Session session;

public void connect(String node) {

cluster = Cluster.builder().addContactPoint(“”).build();
session = cluster.connect(“appsonekeyspace”);

ResultSet results = session.execute(“SELECT * FROM users”);
for (Row row : results) {
System.out.format(“%s %s\n”, row.getString(“user_name”), row.getString(“password”));

public void close() {

public static void main(String[] args) {
CassendraClient client = new CassendraClient();

| Tagged , , | Leave a comment

Simple Queries for Execute Apache Cassendra

Create KeySpace 1:

CREATE KEYSPACE appsonekeyspace WITH REPLICATION = { ‘class’ : ‘SimpleStrategy’, ‘replication_factor’ : 1 };

Use KeySpace 2:

use appsonekeyspace;

Create table in Keyspace 3:

user_id int PRIMARY KEY,
user_name text,
password text

Insert Some Values into Table 4:

INSERT INTO users (user_id, user_name, password) VALUES (1745, ‘Ashoka’, ‘test’);
INSERT INTO users (user_id, user_name, password) VALUES (1744, ‘Murali’, ‘demo’);
INSERT INTO users (user_id, user_name, password) VALUES (1746, ‘Bhat’, ‘sample’);

Query Simple SELECT 4:

SELECT * FROM users;

| Leave a comment

Apache Cassendra Installation and Configuration Guide

Installing Cassandra Locally

This document aims to provide a few easy to follow steps to take the first-time user from installation, to running single node Cassandra, and overview to configure multinode cluster. Cassandra is meant to run on a cluster of nodes, but will run equally well on a single machine. This is a handy way of getting familiar with the software while avoiding the complexities of a larger system.

Step 0: Prerequisites and Connecting to the Community

Cassandra requires the most stable version of Java 7 or 8 you can deploy, preferably the Oracle/Sun JVM. Cassandra also runs on OpenJDK and the IBM JVM. (It will NOT run on JRockit, which is only compatible with Java 6.)

The best way to ensure you always have up to date information on the project, releases, stability, bugs, and features is to subscribe to the users mailing list (subscription required) and participate in the #cassandra channel on IRC.

Step 1: Download Cassandra

Download links for the latest stable release can always be found on the website.
Users of Debian or Debian-based derivatives can install the latest stable release in package form, see DebianPackaging for details.
Users of RPM-based distributions can get packages from Datastax.
If you are interested in building Cassandra from source, please refer to How to Build page.
For more details about misc builds, please refer to Cassandra versions and builds page.

Step 2: Basic Configuration

The Cassandra configuration files can be found in the conf directory of binary and source distributions. If you have installed Cassandra from a deb or rpm package, the configuration files will be located in /etc/cassandra.

Step 2.1: Directories Used by Cassandra

If you’ve installed Cassandra with a deb or rpm package, the directories that Cassandra will use should already be created an have the correct permissions. Otherwise, you will want to check the following config settings from conf/cassandra.yaml: data_file_directories (/var/lib/cassandra/data), commitlog_directory (/var/lib/cassandra/commitlog), and saved_caches_directory (/var/lib/cassandra/saved_caches). Make sure these directories exist and can be written to.

By default, Cassandra will write its logs in /var/log/cassandra/. Make sure this directory exists and is writeable, or change this line in conf/log4j-server.properies:

Note that in Cassandra 2.1+, the logger in use is logback, so change this logging directory in your conf/logback.xml file such as:

JVM-level settings such as heap size can be set in conf/cassandra-env.sh.

Step 3: Start Cassandra

And now for the moment of truth, start up Cassandra by invoking ‘bin/cassandra -f’ from the command line1. The service should start in the foreground and log gratuitously to the console. Assuming you don’t see messages with scary words like “error”, or “fatal”, or anything that looks like a Java stack trace, then everything should be working.

Press “Control-C” to stop Cassandra.

If you start up Cassandra without the “-f” option, it will run in the background. You can stop the process by killing it, using ‘pkill -f CassandraDaemon’, for example.

Cassandra Users of recent Linux distributions and Mac OS X Snow Leopard should be able to start up Cassandra simply by untarring and invoking bin/cassandra -f. Since Cassandra 2.1, the tar.gz download has shipped with the log and data directories defaulting to the Cassandra directory. Versions prior defaulted to /var/log/cassandra and /var/lib/cassandra/. Due to this it is necessary to either start Cassandra with root privileges or change the conf/cassandra.yaml to use a directory owned by the current user. Snow Leopard ships with Java 1.6.0 and does not require changing the JAVA_HOME environment variable or adding any directory to your PATH. On Linux just make sure you have a working Java JDK package installed such as the openjdk-6-jdk on Ubuntu Lucid Lynx.

Step 4: Using cqlsh

bin/cqlsh is an interactive command line interface for Cassandra. cqlsh allows you to execute CQL (Cassandra Query Language) statements against Cassandra. Using CQL, you can define a schema, insert data, execute queries. Run the following command to connect to your local Cassandra instance with cqlsh:

$ bin/cqlsh
You should see the following prompt, if successful:

Connected to Test Cluster at localhost:9160.
[cqlsh 2.3.0 | Cassandra 1.2.2 | CQL spec 3.0.0 | Thrift protocol 19.35.0]
Use HELP for help.
For clarity, we will omit the cqlsh prompt in the following examples.

You can access the online help with ‘help;’ command. Commands are terminated with a semicolon (‘;’) in cqlsh.

First, create a keyspace — a namespace of tables.

WITH REPLICATION = { ‘class’ : ‘SimpleStrategy’, ‘replication_factor’ : 1 };
Second, authenticate to the new keyspace:

USE mykeyspace;
Third, create a users table:

user_id int PRIMARY KEY,
fname text,
lname text
Now you can store data into users:

INSERT INTO users (user_id, fname, lname)
VALUES (1745, ‘john’, ‘smith’);
INSERT INTO users (user_id, fname, lname)
VALUES (1744, ‘john’, ‘doe’);
INSERT INTO users (user_id, fname, lname)
VALUES (1746, ‘john’, ‘smith’);
Now let’s fetch the data you inserted:

SELECT * FROM users;
You should see output reflecting your new rows:

user_id | fname | lname
1745 | john | smith
1744 | john | doe
1746 | john | smith
You can retrieve data about users whose last name is smith by creating an index, then querying the table as follows:

CREATE INDEX ON users (lname);

SELECT * FROM users WHERE lname = ‘smith’;

user_id | fname | lname
1745 | john | smith
1746 | john | smith
Write your Application

To connect to Cassandra, you’ll need a database driver for your language of choice. DataStax sponsors development of CQL drivers at https://github.com/datastax. A full list of CQL drivers can be found on the ClientOptions page.

When deciding how to design your schema and layout your data, it will be helpful to review the resources on how to DataModel.

You may also want to read the full CQL documentation.

Configuring Multinode Clusters

Now you have single working Cassandra node. It is a Cassandra cluster which has only one node. By adding more nodes, you can make it a multi node cluster.

Setting up a Cassandra cluster is almost as simple as repeating the above procedures for each node in your cluster. There are a few minor exceptions though.

Cassandra nodes exchange information about one another using a mechanism called Gossip, but to get the ball rolling a newly started node needs to know of at least one other, this is called a Seed. It’s customary to pick a small number of relatively stable nodes to serve as your seeds, but there is no hard-and-fast rule here. Do make sure that each seed also knows of at least one other, remember, the goal is to avoid a chicken-and-egg scenario and provide an avenue for all nodes in the cluster to discover one another.

In addition to seeds, you’ll also need to configure the IP interface to listen on for Gossip and CQL, (listen_address and rpc_address respectively). Use a ‘listen_address that will be reachable from the listen_address used on all other nodes, and a rpc_address` that will be accessible to clients.

Once everything is configured and the nodes are running, use the bin/nodetool status utility to verify a properly connected cluster. For example:

$ bin/nodetool -host -p 7199 status
Datacenter: datacenter1
|/ State=Normal/Leaving/Joining/Moving
— Address Load Tokens Owns Host ID Rack
UN 30.99 KB 256 32.4% 92b20e08-9ddd-4f55-9173-8516e74d27f5 rack1
UN 31 KB 256 31.5% b9616658-c744-48fb-b64f-83f96b007d93 rack1
UN 30.96 KB 256 36.1% f7a08973-85bd-460f-8176-d6f9df8c23f4 rack1
Advanced cluster management is described in Operations.

If you don’t yet have access to hardware for a real Cassandra cluster, you can manage local clusters easily with ccm (Cassandra Cluster Manager).

| Leave a comment