Telecom – 5G Revolution By Ericsson

5G is the Fifth Generation technology. It has many advanced features potential enough to solve many of the problems of our mundane life. It is beneficial for the government, as it can make the governance easier; for the students, as it can make available the advanced courses, classes, and materials online; it is easier for the common people as well, as it can facilitate them the internet everywhere. So, this tutorial is divided into various chapters and describes the 5G technology, its applications, challenges, etc., in detail.

Radio technologies have evidenced a rapid and multidirectional evolution with the launch of the analogue cellular systems in 1980s. Thereafter, digital wireless communication systems are consistently on a mission to fulfil the growing need of human beings (1G, …4G, or now 5G).

5G Technology

So, this article describes the 5G technology emphasizing on its salient features, technological design (architecture), advantages, shortcomings, challenges, and future scope.

Salient Features of 5G

5th Generation Mobile Network or simply 5G is the forthcoming revolution of mobile technology. The features and its usability are much beyond the expectation of a normal human being. With its ultra-high speed, it is potential enough to change the meaning of a cell phone usability.

Salient Features of 5G

With a huge array of innovative features, now your smart phone would be more parallel to the laptop. You can use broadband internet connection; other significant features that fascinate people are more gaming options, wider multimedia options, connectivity everywhere, zero latency, faster response time, and high quality sound and HD video can be transferred on other cell phone without compromising with the quality of audio and video

If we look back, we will find that every next decade, one generation is advancing in the field of mobile technology. Starting from the First Generation (1G) in 1980s, Second Generation (2G) in 1990s, Third Generation (3G) in 2000s, Fourth Generation (4G) in 2010s, and now Fifth Generation (5G), we are advancing towards more and more sophisticated and smarter technology.

Smarter Technology

What is 5G Technology?

The 5G technology is expected to provide a new (much wider than the previous one) frequency bands along with the wider spectral bandwidth per frequency channel. As of now, the predecessors (generations) mobile technologies have evidenced substantial increase in peak bitrate. Then — how is 5G different from the previous one (especially 4G)? The answer is — it is not only the increase in bitrate made 5G distinct from the 4G, but rather 5G is also advanced in terms of −

  • High increased peak bit rate
  • Larger data volume per unit area (i.e. high system spectral efficiency)
  • High capacity to allow more devices connectivity concurrently and instantaneously
  • Lower battery consumption
  • Better connectivity irrespective of the geographic region, in which you are
  • Larger number of supporting devices
  • Lower cost of infrastructural development
  • Higher reliability of the communications

As researchers say, with the wide range of bandwidth radio channels, it is able to support the speed up to 10 Gbps, the 5G WiFi technology will offer contiguous and consistent coverage − “wider area mobility in true sense.”

Architecture of 5G is highly advanced, its network elements and various terminals are characteristically upgraded to afford a new situation. Likewise, service providers can implement the advance technology to adopt the value-added services easily.

However, upgradeability is based upon cognitive radio technology that includes various significant features such as ability of devices to identify their geographical location as well as weather, temperature, etc. Cognitive radio technology acts as a transceiver (beam) that perceptively can catch and respond radio signals in its operating environment. Further, it promptly distinguishes the changes in its environment and hence respond accordingly to provide uninterrupted quality service.

Architecture of 5G

As shown in the following image, the system model of 5G is entirely IP based model designed for the wireless and mobile networks.

5G Architecture

The system comprising of a main user terminal and then a number of independent and autonomous radio access technologies. Each of the radio technologies is considered as the IP link for the outside internet world. The IP technology is designed exclusively to ensure sufficient control data for appropriate routing of IP packets related to a certain application connections i.e. sessions between client applications and servers somewhere on the Internet. Moreover, to make accessible routing of packets should be fixed in accordance with the given policies of the user (as shown in the image given below).

5G Architecture1

The Master Core Technology

As shown in the Figure 5, the 5G MasterCore is convergence point for the other technologies, which have their own impact on existing wireless network. Interestingly, its design facilitates MasterCore to get operated into parallel multimode including all IP network mode and 5G network mode. In this mode (as shown in the image given below), it controls all network technologies of RAN and Different Access Networks (DAT). Since, the technology is compatible and manages all the new deployments (based on 5G), it is more efficient, less complicated, and more powerful.

Master Core Technology

Surprisingly, any service mode can be opened under 5G New Deployment Mode as World Combination Service Mode (WCSM). WCSM is a wonderful feature of this technology; for example, if a professor writes on the white board in a country – it can be displayed on another white board in any other part of the world besides conversation and video. Further, a new services can be easily added through parallel multimode service.

Normally, it is expected that the time period required for the 5G technology development and its implementation is about five years more from now (by 2020). But to becoming usable for the common people in developing countries, it could be even more.

Time Period Required

Graph 1 − Showing the Timeline of all previous generation technologies.

Expected Time Length

By considering the multiple utility and various fashionable salient features, researchers are anticipating that this technology will be in use until 2040s.

G technology is adorned with many as well as distinct features, which applicability is useful for a wide range people irrespective of their purposes (as shown in the mweb image).

Applications

Applications of 5G

Some of the significant applications are −

  • It will make unified global standard for all.
  • Network availability will be everywhere and will facilitate people to use their computer and such kind of mobile devices anywhere anytime.
  • Because of the IPv6 technology, visiting care of mobile IP address will be assigned as per the connected network and geographical position.
  • Its application will make world real Wi Fi zone.
  • Its cognitive radio technology will facilitate different version of radio technologies to share the same spectrum efficiently.
  • Its application will facilitate people to avail radio signal at higher altitude as well.

    Application of 5G is very much equivalent to accomplishment of dream. It is integrated with beyond the limit advance features in comparison to the previous technologies.

    Advancement

    Advanced Features

    In comparison to previous radio technologies, 5G has following advancement −

    • Practically possible to avail the super speed i.e. 1 to 10 Gbps.
    • Latency will be 1 millisecond (end-to-end round trip).
    • 1,000x bandwidth per unit area.
    • Feasibility to connect 10 to 100 number of devices.
    • Worldwide coverage.
    • About 90% reduction in network energy usage.
    • Battery life will be much longer.
    • Whole world will be in wi fi zone.

    5G – Advantages & Disadvantages

    5th generation technology offers a wide range of features, which are beneficial for all group of people including, students, professionals (doctors, engineers, teachers, governing bodies, administrative bodies, etc.) and even for a common man.

    Ericsson

    Important Advantages

    There are several advantages of 5G technology, some of the advantages have been shown in the above Ericsson image, and many others are described below −

    • High resolution and bi-directional large bandwidth shaping.
    • Technology to gather all networks on one platform.
    • More effective and efficient.
    • Technology to facilitate subscriber supervision tools for the quick action.
    • Most likely, will provide a huge broadcasting data (in Gigabit), which will support more than 60,000 connections.
    • Easily manageable with the previous generations.
    • Technological sound to support heterogeneous services (including private network).
    • Possible to provide uniform, uninterrupted, and consistent connectivity across the world.
    Some Other Advantages for the Common People
    Parallel multiple services, such as you can know weather and location while talking with other person.
    You can control your PCs by handsets.
    Education will become easier − A student sitting in any part of world can attend the class.
    Medical Treatment will become easier & frugal − A doctor can treat the patient located in remote part of the world.
    Monitoring will be easier − A governmental organization and investigating offers can monitor any part of the world. Possible to reduce the crime rate.
    Visualizing universe, galaxies, and planets will be possible.
    Possible to locate and search the missing person.
    Possible, natural disaster including tsunami, earthquake etc. can be detected faster.

    Disadvantages of 5G Technology

    Though, 5G technology is researched and conceptualized to solve all radio signal problems and hardship of mobile world, but because of some security reason and lack of technological advancement in most of the geographic regions, it has following shortcomings −

    • Technology is still under process and research on its viability is going on.
    • The speed, this technology is claiming seems difficult to achieve (in future, it might be) because of the incompetent technological support in most parts of the world.

    Disadvantages

    • Many of the old devices would not be competent to 5G, hence, all of them need to be replaced with new one — expensive deal.
    • Developing infrastructure needs high cost.
    • Security and privacy issue yet to be solved.

      Challenges are the inherent part of the new development; so, like all technologies, 5G has also big challenges to deal with. As we see past i.e. development of radio technology, we find very fast growth. Starting from 1G to 5G, the journey is merely of about 40 years old (Considering 1G in 1980s and 5G in 2020s). However, in this journey, the common challenges that we observed are lack of infrastructure, research methodology, and cost.

      ChallengesStill, there are dozens of countries using 2G and 3G technologies and don’t know even about 4G, in such a condition, the most significant questions in everyone’s mind are −

      • How far will 5G be viable?
      • Will it be the technology of some of the developed countries or developing countries will also get benefit of this?

      To understand these questions, the challenges of 5G are categorized into the following two headings −

      • Technological Challenges
      • Common Challenges

      Technological Challenges

      • Inter-cell Interference − This is one of the major technological issues that need to be solved. There is variations in size of traditional macro cells and concurrent small cells that will lead to interference.

      Technological Challenges

      • Efficient Medium Access Control − In a situation, where dense deployment of access points and user terminals are required, the user throughput will be low, latency will be high, and hotspots will not be competent to cellular technology to provide high throughput. It needs to be researched properly to optimize the technology.
      • Traffic Management − In comparison to the traditional human to human traffic in cellular networks, a great number of Machine to Machine (M2M) devices in a cell may cause serious system challenges i.e. radio access network (RAN) challenges, which will cause overload and congestion.

      Common Challenges

      • Multiple Services − Unlike other radio signal services, 5G would have a huge task to offer services to heterogeneous networks, technologies, and devices operating in different geographic regions. So, the challenge is of standardization to provide dynamic, universal, user-centric, and data-rich wireless services to fulfil the high expectation of people.

      Common Challenges

      • Infrastructure − Researchers are facing technological challenges of standardization and application of 5G services.
      • Communication, Navigation, & Sensing − These services largely depend upon the availability of radio spectrum, through which signals are transmitted. Though 5G technology has strong computational power to process the huge volume of data coming from different and distinct sources, but it needs larger infrastructure support.
      • Security and Privacy − This is one of the most important challenges that 5G needs to ensure the protection of personal data. 5G will have to define the uncertainties related to security threats including trust, privacy, cybersecurity, which are growing across the globe.
      • Legislation of Cyberlaw − Cybercrime and other fraud may also increase with the high speed and ubiquitous 5G technology. Therefore, legislation of the Cyberlaw is also an imperative issue, which largely is governmental and political (national as well as international issue) in nature.

        Several researches and discussions are going on across the world among technologists, researchers, academicians, vendors, operators, and governments about the innovations, implementation, viability, and security concerns of 5G.

        As proposed, loaded with multiple advance features starting from the super high speed internet service to smooth ubiquitous service, 5G will unlock many of the problems. However, the question is — in a situation, where the previous technologies (4G and 3G) are still under process and in many parts yet to be started; what will be the future of 5G?

        Future Scope5th generation technology is designed to provide incredible and remarkable data capabilities, unhindered call volumes, and immeasurable data broadcast within the latest mobile operating system. Hence, it is more intelligent technology, which will interconnect the entire world without limits. Likewise, our world would have universal and uninterrupted access to information, communication, and entertainment that will open a new dimension to our lives and will change our life style meaningfully.

        Moreover, governments and regulators can use this technology as an opportunity for the good governance and can create healthier environments, which will definitely encourage continuing investment in 5G, the next generation technology.

Advertisements
| Leave a comment

Telecom – LTE Technology [ Long Term Evaluation ]

LTE stands for Long Term Evolution and it was started as a project in 2004 by telecommunication body known as the Third Generation Partnership Project (3GPP). LTE evolved from an earlier 3GPP system known as the Universal Mobile Telecommunication System (UMTS), which in turn evolved from the Global System for Mobile Communications (GSM).

LTE stands for Long Term Evolution and it was started as a project in 2004 by telecommunication body known as the Third Generation Partnership Project (3GPP). SAE (System Architecture Evolution) is the corresponding evolution of the GPRS/3G packet core network evolution. The term LTE is typically used to represent both LTE and SAE.

LTE evolved from an earlier 3GPP system known as the Universal Mobile Telecommunication System (UMTS), which in turn evolved from the Global System for Mobile Communications (GSM). Even related specifications were formally known as the evolved UMTS terrestrial radio access (E-UTRA) and evolved UMTS terrestrial radio access network (E-UTRAN). First version of LTE was documented in Release 8 of the 3GPP specifications.

A rapid increase of mobile data usage and emergence of new applications such as MMOG (Multimedia Online Gaming), mobile TV, Web 2.0, streaming contents have motivated the 3rd Generation Partnership Project (3GPP) to work on the Long-Term Evolution (LTE) on the way towards fourth-generation mobile.

The main goal of LTE is to provide a high data rate, low latency and packet optimized radioaccess technology supporting flexible bandwidth deployments. Same time its network architecture has been designed with the goal to support packet-switched traffic with seamless mobility and great quality of service.

LTE Evolution

Year Event
Mar 2000 Release 99 – UMTS/WCDMA
Mar 2002 Rel 5 – HSDPA
Mar 2005 Rel 6 – HSUPA
Year 2007 Rel 7 – DL MIMO, IMS (IP Multimedia Subsystem)
November 2004 Work started on LTE specification
January 2008 Spec finalized and approved with Release 8
2010 Targeted first deployment

Facts about LTE

  • LTE is the successor technology not only of UMTS but also of CDMA 2000.
  • LTE is important because it will bring up to 50 times performance improvement and much better spectral efficiency to cellular networks.
  • LTE introduced to get higher data rates, 300Mbps peak downlink and 75 Mbps peak uplink. In a 20MHz carrier, data rates beyond 300Mbps can be achieved under very good signal conditions.
  • LTE is an ideal technology to support high date rates for the services such as voice over IP (VOIP), streaming multimedia, videoconferencing or even a high-speed cellular modem.
  • LTE uses both Time Division Duplex (TDD) and Frequency Division Duplex (FDD) mode. In FDD uplink and downlink transmission used different frequency, while in TDD both uplink and downlink use the same carrier and are separated in Time.
  • LTE supports flexible carrier bandwidths, from 1.4 MHz up to 20 MHz as well as both FDD and TDD. LTE designed with a scalable carrier bandwidth from 1.4 MHz up to 20 MHz which bandwidth is used depends on the frequency band and the amount of spectrum available with a network operator.
  • All LTE devices have to support (MIMO) Multiple Input Multiple Output transmissions, which allow the base station to transmit several data streams over the same carrier simultaneously.
  • All interfaces between network nodes in LTE are now IP based, including the backhaul connection to the radio base stations. This is great simplification compared to earlier technologies that were initially based on E1/T1, ATM and frame relay links, with most of them being narrowband and expensive.
  • Quality of Service (QoS) mechanism have been standardized on all interfaces to ensure that the requirement of voice calls for a constant delay and bandwidth, can still be met when capacity limits are reached.
  • Works with GSM/EDGE/UMTS systems utilizing existing 2G and 3G spectrum and new spectrum. Supports hand-over and roaming to existing mobile networks.

Advantages of LTE

  • High throughput: High data rates can be achieved in both downlink as well as uplink. This causes high throughput.
  • Low latency: Time required to connect to the network is in range of a few hundred milliseconds and power saving states can now be entered and exited very quickly.
  • FDD and TDD in the same platform: Frequency Division Duplex (FDD) and Time Division Duplex (TDD), both schemes can be used on same platform.
  • Superior end-user experience: Optimized signaling for connection establishment and other air interface and mobility management procedures have further improved the user experience. Reduced latency (to 10 ms) for better user experience.
  • Seamless Connection: LTE will also support seamless connection to existing networks such as GSM, CDMA and WCDMA.
  • Plug and play: The user does not have to manually install drivers for the device. Instead system automatically recognizes the device, loads new drivers for the hardware if needed, and begins to work with the newly connected device.
  • Simple architecture: Because of Simple architecture low operating expenditure (OPEX).

LTE – QoS

LTE architecture supports hard QoS, with end-to-end quality of service and guaranteed bit rate (GBR) for radio bearers. Just as Ethernet and the internet have different types of QoS, for example, various levels of QoS can be applied to LTE traffic for different applications. Because the LTE MAC is fully scheduled, QoS is a natural fit.

Evolved Packet System (EPS) bearers provide one-to-one correspondence with RLC radio bearers and provide support for Traffic Flow Templates (TFT). There are four types of EPS bearers:

  • GBR Bearer resources permanently allocated by admission control
  • Non-GBR Bearer no admission control
  • Dedicated Bearer associated with specific TFT (GBR or non-GBR)
  • Default Bearer Non GBR, catch-all for unassigned trafficThis section will summarize the Basic parameters of the LTE:
    Parameters Description
    Frequency range UMTS FDD bands and TDD bands defined in 36.101(v860) Table 5.5.1, given below
    Duplexing FDD, TDD, half-duplex FDD
    Channel coding Turbo code
    Mobility 350 km/h
    Channel Bandwidth (MHz)
    • 1.4
    • 3
    • 5
    • 10
    • 15
    • 20
    Transmission Bandwidth Configuration NRB : (1 resource block = 180kHz in 1ms TTI )
    • 6
    • 15
    • 25
    • 50
    • 75
    • 100
    Modulation Schemes UL: QPSK, 16QAM, 64QAM(optional)


    DL: QPSK, 16QAM, 64QAM

    Multiple Access Schemes UL: SC-FDMA (Single Carrier Frequency Division Multiple Access) supports 50Mbps+ (20MHz spectrum)


    DL: OFDM (Orthogonal Frequency Division Multiple Access) supports 100Mbps+ (20MHz spectrum)

    Multi-Antenna Technology UL: Multi-user collaborative MIMO


    DL: TxAA, spatial multiplexing, CDD ,max 4×4 array

    Peak data rate in LTE UL: 75Mbps(20MHz bandwidth)


    DL: 150Mbps(UE Category 4, 2×2 MIMO, 20MHz bandwidth)


    DL: 300Mbps(UE category 5, 4×4 MIMO, 20MHz bandwidth)

    MIMO

    (Multiple Input Multiple Output)

    UL: 1 x 2, 1 x 4


    DL: 2 x 2, 4 x 2, 4 x 4

    Coverage 5 – 100km with slight degradation after 30km
    QoS E2E QOS allowing prioritization of different class of service
    Latency End-user latency < 10mS

    E-UTRA Operating Bands

    Following is the table for E-UTRA operating bands taken from LTE Sepecification 36.101(v860) Table 5.5.1:

    E-UTRA Table 5.5.1The high-level network architecture of LTE is comprised of following three main components:

    • The User Equipment (UE).
    • The Evolved UMTS Terrestrial Radio Access Network (E-UTRAN).
    • The Evolved Packet Core (EPC).

    The evolved packet core communicates with packet data networks in the outside world such as the internet, private corporate networks or the IP multimedia subsystem. The interfaces between the different parts of the system are denoted Uu, S1 and SGi as shown below:

    LTE Architecture

    The User Equipment (UE)

    The internal architecture of the user equipment for LTE is identical to the one used by UMTS and GSM which is actually a Mobile Equipment (ME). The mobile equipment comprised of the following important modules:

    • Mobile Termination (MT) : This handles all the communication functions.
    • Terminal Equipment (TE) : This terminates the data streams.
    • Universal Integrated Circuit Card (UICC) : This is also known as the SIM card for LTE equipments. It runs an application known as the Universal Subscriber Identity Module (USIM).

    A USIM stores user-specific data very similar to 3G SIM card. This keeps information about the user’s phone number, home network identity and security keys etc.

    The E-UTRAN (The access network)

    The architecture of evolved UMTS Terrestrial Radio Access Network (E-UTRAN) has been illustrated below.

    LTE E-UTRANThe E-UTRAN handles the radio communications between the mobile and the evolved packet core and just has one component, the evolved base stations, called eNodeB or eNB. Each eNB is a base station that controls the mobiles in one or more cells. The base station that is communicating with a mobile is known as its serving eNB.

    LTE Mobile communicates with just one base station and one cell at a time and there are following two main functions supported by eNB:

    • The eBN sends and receives radio transmissions to all the mobiles using the analogue and digital signal processing functions of the LTE air interface.
    • The eNB controls the low-level operation of all its mobiles, by sending them signalling messages such as handover commands.

    Each eBN connects with the EPC by means of the S1 interface and it can also be connected to nearby base stations by the X2 interface, which is mainly used for signalling and packet forwarding during handover.

    A home eNB (HeNB) is a base station that has been purchased by a user to provide femtocell coverage within the home. A home eNB belongs to a closed subscriber group (CSG) and can only be accessed by mobiles with a USIM that also belongs to the closed subscriber group.

    The Evolved Packet Core (EPC) (The core network)

    The architecture of Evolved Packet Core (EPC) has been illustrated below. There are few more components which have not been shown in the diagram to keep it simple. These components are like the Earthquake and Tsunami Warning System (ETWS), the Equipment Identity Register (EIR) and Policy Control and Charging Rules Function (PCRF).

    LTE EPCBelow is a brief description of each of the components shown in the above architecture:

    • The Home Subscriber Server (HSS) component has been carried forward from UMTS and GSM and is a central database that contains information about all the network operator’s subscribers.
    • The Packet Data Network (PDN) Gateway (P-GW) communicates with the outside world ie. packet data networks PDN, using SGi interface. Each packet data network is identified by an access point name (APN). The PDN gateway has the same role as the GPRS support node (GGSN) and the serving GPRS support node (SGSN) with UMTS and GSM.
    • The serving gateway (S-GW) acts as a router, and forwards data between the base station and the PDN gateway.
    • The mobility management entity (MME) controls the high-level operation of the mobile by means of signalling messages and Home Subscriber Server (HSS).
    • The Policy Control and Charging Rules Function (PCRF) is a component which is not shown in the above diagram but it is responsible for policy control decision-making, as well as for controlling the flow-based charging functionalities in the Policy Control Enforcement Function (PCEF), which resides in the P-GW.

    The interface between the serving and PDN gateways is known as S5/S8. This has two slightly different implementations, namely S5 if the two devices are in the same network, and S8 if they are in different networks.

    Functional split between the E-UTRAN and the EPC

    Following diagram shows the functional split between the E-UTRAN and the EPC for an LTE network:

    LTE E-UTRAN and EPC

    2G/3G Versus LTE

    Following table compares various important Network Elements & Signaling protocols used in 2G/3G abd LTE.

    2G/3G LTE
    GERAN and UTRAN E-UTRAN
    SGSN/PDSN-FA S-GW
    GGSN/PDSN-HA PDN-GW
    HLR/AAA HSS
    VLR MME
    SS7-MAP/ANSI-41/RADIUS Diameter
    DiameterGTPc-v0 and v1 GTPc-v2
    MIP PMIP

    A network run by one operator in one country is known as a Public Land Mobile Network (PLMN) and when a subscribed user uses his operator’s PLMN then it is said Home-PLMN but roaming allows users to move outside their home network and using the resources from other operator’s network. This other network is called Visited-PLMN.

    A roaming user is connected to the E-UTRAN, MME and S-GW of the visited LTE network. However, LTE/SAE allows the P-GW of either the visited or the home network to be used, as shown in below:

    LTE Roaming ArchitectureThe home network’s P-GW allows the user to access the home operator’s services even while in a visited network. A P-GW in the visited network allows a “local breakout” to the Internet in the visited network.

    The interface between the serving and PDN gateways is known as S5/S8. This has two slightly different implementations, namely S5 if the two devices are in the same network, and S8 if they are in different networks. For mobiles that are not roaming, the serving and PDN gateways can be integrated into a single device, so that the S5/S8 interface vanishes altogether.

    LTE Roaming Charging

    The complexities of the new charging mechanisms required to support 4G roaming are much more abundant than in a 3G environment. Few words about both pre-paid and post-paid charging for LTE roaming is given below:

    • Prepaid Charging – The CAMEL standard, which enables prepaid services in 3G, is not supported in LTE; therefore, prepaid customer information must be routed back to the home network as opposed to being handled by the local visited network. As a result, operators must rely on new accounting flows to access prepaid customer data, such as through their P-Gateways in both IMS and non-IMS environments or via their CSCF in an IMS environment.
    • Postpaid Charging – Postpaid data-usage charging works the same in LTE as it does in 3G, using versions TAP 3.11 or 3.12. With local breakout of IMS services, TAP 3.12 is required.

    Operators do not have the same amount of visibility into subscriber activities as they do in home-routing scenarios in case of local breakout scenarios because subscriber-data sessions are kept within the visited network; therefore, in order for the home operator to capture real-time information on both pre- and postpaid customers, it must establish a Diameter interface between charging systems and the visited network’s P-Gateway.

    In case of local breakout of ims services scenario, the visited network creates call detail records (CDRs) from the S-Gateway(s), however, these CDRs do not contain all of the information required to create a TAP 3.12 mobile session or messaging event record for the service usage. As a result, operators must correlate the core data network CDRs with the IMS CDRs to create TAP records.

    An LTE network area is divided into three different types of geographical areas explained below:

    S.N. Area and Description
    1 The MME pool areas

    This is an area through which the mobile can move without a change of serving MME. Every MME pool area is controlled by one or more MMEs on the network.

    2 The S-GW service areas

    This is an area served by one or more serving gateways S-GW, through which the mobile can move without a change of serving gateway.

    3 The Tracking areas

    The MME pool areas and the S-GW service areas are both made from smaller, non-overlapping units known as tracking areas (TAs). They are similar to the location and routing areas from UMTS and GSM and will be used to track the locations of mobiles that are on standby mode.

    Thus an LTE network will comprise of many MME pool areas, many S-GW service areas and lots of tracking areas.

    The Network IDs

    The network itself will be identified using Public Land Mobile Network Identity (PLMN-ID) which will have a three digit mobile country code (MCC) and a two or three digit mobile network code (MNC). For example, the Mobile Country Code for the UK is 234, while Vodafone’s UK network uses a Mobile Network Code of 15.

    LTE Network ID

    The MME IDs

    Each MME has three main identities. An MME code (MMEC) uniquely identifies the MME within all the pool areas. A group of MMEs is assigned an MME Group Identity (MMEGI) which works along with MMEC to make MME identifier (MMEI). A MMEI uniquely identifies the MME within a particular network.

    LTE MMEIIf we combile PLMN-ID with the MMEI then we arrive at a Globally Unique MME Identifier (GUMMEI), which identifies an MME anywhere in the world:

    LTE GUMMEI

    The Tracking Area IDs

    Each tracking area has two main identities. The tracking area code (TAC) identifies a tracking area within a particular network and if we combining this with the PLMN-ID then we arrive at a Globally Unique Tracking Area Identity (TAI).

    LTE TAI

    The Cell IDs

    Each cell in the network has three types of identity. The E-UTRAN cell identity (ECI) identifies a cell within a particular network, while the E-UTRAN cell global identifier (ECGI) identifies a cell anywhere in the world.

    The physical cell identity, which is a number from 0 to 503 and it distinguishes a cell from its immediate neighbours.

    The Mobile Equipment ID

    The international mobile equipment identity (IMEI) is a unique identity for the mobile equipment and the International Mobile Subscriber Identity (IMSI) is a unique identity for the UICC and the USIM.

    The M temporary mobile subscriber identity (M-TMSI) identifies a mobile to its serving MME. Adding the MME code in M-TMSI results in a S temporary mobile subscriber identity (S-TMSI), which identifies the mobile within an MME pool area.

    LTE S-TMSIFinally adding the MME group identity and the PLMN identity with S-TMSI results in the Globally Unique Temporary Identity (GUTI).

  • LTE GUTIThe radio protocol architecture for LTE can be separated into control plane architecture and user plane architecture as shown below:LTE Radio Protocol ArchitectureAt user plane side, the application creates data packets that are processed by protocols such as TCP, UDP and IP, while in the control plane, the radio resource control (RRC) protocol writes the signalling messages that are exchanged between the base station and the mobile. In both cases, the information is processed by the packet data convergence protocol (PDCP), the radio link control (RLC) protocol and the medium access control (MAC) protocol, before being passed to the physical layer for transmission.

    User Plane

    The user plane protocol stack between the e-Node B and UE consists of the following sub-layers:

    • PDCP (Packet Data Convergence Protocol)
    • RLC (radio Link Control)
    • Medium Access Control (MAC)

    On the user plane, packets in the core network (EPC) are encapsulated in a specific EPC protocol and tunneled between the P-GW and the eNodeB. Different tunneling protocols are used depending on the interface. GPRS Tunneling Protocol (GTP) is used on the S1 interface between the eNodeB and S-GW and on the S5/S8 interface between the S-GW and P-GW.

    LTE User PlanePackets received by a layer are called Service Data Unit (SDU) while the packet output of a layer is referred to by Protocol Data Unit (PDU) and IP packets at user plane flow from top to bottom layers.

    Control Plane

    The control plane includes additionally the Radio Resource Control layer (RRC) which is responsible for configuring the lower layers.

    The Control Plane handles radio-specific functionality which depends on the state of the user equipment which includes two states: idle or connected.

    Mode Description
    Idle The user equipment camps on a cell after a cell selection or reselection process where factors like radio link quality, cell status and radio access technology are considered. The UE also monitors a paging channel to detect incoming calls and acquire system information. In this mode, control plane protocols include cell selection and reselection procedures.
    Connected The UE supplies the E-UTRAN with downlink channel quality and neighbour cell information to enable the E-UTRAN to select the most suitable cell for the UE. In this case, control plane protocol includes the Radio Link Control (RRC) protocol.

    The protocol stack for the control plane between the UE and MME is shown below. The grey region of the stack indicates the access stratum (AS) protocols. The lower layers perform the same functions as for the user plane with the exception that there is no header compression function for the control plane.

    Let’s have a close look at all the layers available in E-UTRAN Protocol Stack which we have seen in previous chapter. Below is a more ellaborated diagram of E-UTRAN Protocol Stack:

    LTE Protocol Layers

    Physical Layer (Layer 1)

    Physical Layer carries all information from the MAC transport channels over the air interface. Takes care of the link adaptation (AMC), power control, cell search (for initial synchronization and handover purposes) and other measurements (inside the LTE system and between systems) for the RRC layer.

    Medium Access Layer (MAC)

    MAC layer is responsible for Mapping between logical channels and transport channels, Multiplexing of MAC SDUs from one or different logical channels onto transport blocks (TB) to be delivered to the physical layer on transport channels, de multiplexing of MAC SDUs from one or different logical channels from transport blocks (TB) delivered from the physical layer on transport channels, Scheduling information reporting, Error correction through HARQ, Priority handling between UEs by means of dynamic scheduling, Priority handling between logical channels of one UE, Logical Channel prioritization.

    Radio Link Control (RLC)

    RLC operates in 3 modes of operation: Transparent Mode (TM), Unacknowledged Mode (UM), and Acknowledged Mode (AM).

    RLC Layer is responsible for transfer of upper layer PDUs, error correction through ARQ (Only for AM data transfer), Concatenation, segmentation and reassembly of RLC SDUs (Only for UM and AM data transfer).

    RLC is also responsible for re-segmentation of RLC data PDUs (Only for AM data transfer), reordering of RLC data PDUs (Only for UM and AM data transfer), duplicate detection (Only for UM and AM data transfer), RLC SDU discard (Only for UM and AM data transfer), RLC re-establishment, and protocol error detection (Only for AM data transfer).

    Radio Resource Control (RRC)

    The main services and functions of the RRC sublayer include broadcast of System Information related to the non-access stratum (NAS), broadcast of System Information related to the access stratum (AS), Paging, establishment, maintenance and release of an RRC connection between the UE and E-UTRAN, Security functions including key management, establishment, configuration, maintenance and release of point to point Radio Bearers.

    Packet Data Convergence Control (PDCP)

    PDCP Layer is responsible for Header compression and decompression of IP data, Transfer of data (user plane or control plane), Maintenance of PDCP Sequence Numbers (SNs), In-sequence delivery of upper layer PDUs at re-establishment of lower layers, Duplicate elimination of lower layer SDUs at re-establishment of lower layers for radio bearers mapped on RLC AM, Ciphering and deciphering of user plane data and control plane data, Integrity protection and integrity verification of control plane data, Timer based discard, duplicate discarding, PDCP is used for SRBs and DRBs mapped on DCCH and DTCH type of logical channels.

    Non Access Stratum (NAS) Protocols

    The non-access stratum (NAS) protocols form the highest stratum of the control plane between the user equipment (UE) and MME.

    NAS protocols support the mobility of the UE and the session management procedures to establish and maintain IP connectivity between the UE and a PDN GW.

    Below is a logical digram of E-UTRAN Protocol layers with a depiction of data flow through various layers:

    LTE Layers Data FlowPackets received by a layer are called Service Data Unit (SDU) while the packet output of a layer is referred to by Protocol Data Unit (PDU). Let’s see the flow of data from top to bottom:

    • IP Layer submits PDCP SDUs (IP Packets) to the PDCP layer. PDCP layer does header compression and adds PDCP header to these PDCP SDUs. PDCP Layer submits PDCP PDUs (RLC SDUs) to RLC layer.PDCP Header Compression : PDCP removes IP header (Minimum 20 bytes) from PDU, and adds Token of 1-4 bytes. Which provides a tremendous savings in the amount of header that would otherwise have to go over the air.

      LTE PDCP SDU

    • RLC layer does segmentation of these SDUS to make the RLC PDUs. RLC adds header based on RLC mode of operation. RLC submits these RLC PDUs (MAC SDUs) to the MAC layer.RLC Segmentation : If an RLC SDU is large, or the available radio data rate is low (resulting in small transport blocks), the RLC SDU may be split among several RLC PDUs. If the RLC SDU is small, or the available radio data rate is high, several RLC SDUs may be packed into a single PDU.
    • MAC layer adds header and does padding to fit this MAC SDU in TTI. MAC layer submits MAC PDU to physical layer for transmitting it onto physical channels.
    • Physical channel transmits this data into slots of sub frame.The information flows between the different protocols are known as channels and signals. LTE uses several different types of logical, transport and physical channel, which are distinguished by the kind of information they carry and by the way in which the information is processed.
      • Logical Channels : Define whattype of information is transmitted over the air, e.g. traffic channels, control channels, system broadcast, etc. Data and signalling messages are carried on logical channels between the RLC and MAC protocols.
      • Transport Channels : Define howis something transmitted over the air, e.g. what are encoding, interleaving options used to transmit data. Data and signalling messages are carried on transport channels between the MAC and the physical layer.
      • Physical Channels : Define whereis something transmitted over the air, e.g. first N symbols in the DL frame. Data and signalling messages are carried on physical channels between the different levels of the physical layer.

      Logical Channels

      Logical channels define what type of data is transferred. These channels define the data-transfer services offered by the MAC layer. Data and signalling messages are carried on logical channels between the RLC and MAC protocols.

      Logical channels can be divided into control channels and traffic channels. Control Channel can be either common channel or dedicated channel. A common channel means common to all users in a cell (Point to multipoint) while dedicated channels means channels can be used only by one user (Point to Point).

      Logical channels are distinguished by the information they carry and can be classified in two ways. Firstly, logical traffic channels carry data in the user plane, while logical control channels carry signalling messages in the control plane. Following table lists the logical channels that are used by LTE:

      Channel Name Acronym Control channel Traffic channel
      Broadcast Control Channel BCCH X
      Paging Control Channel PCCH X
      Common Control Channel CCCH X
      Dedicated Control Channel DCCH X
      Multicast Control Channel MCCH X
      Dedicated Traffic Channel DTCH X
      Multicast Traffic Channel MTCH X

      Transport Channels

      Transport channels define how and with what type of characteristics the data is transferred by the physical layer. Data and signalling messages are carried on transport channels between the MAC and the physical layer.

      Transport Channels are distinguished by the ways in which the transport channel processor manipulates them. Following table lists the transport channels that are used by LTE:

      Channel Name Acronym Downlink Uplink
      Broadcast Channel BCH X
      Downlink Shared Channel DL-SCH X
      Paging Channel PCH X
      Multicast Channel MCH X
      Uplink Shared Channel UL-SCH X
      Random Access Channel RACH X

      Physical Channels

      Data and signalling messages are carried on physical channels between the different levels of the physical layer and accordingly they are divided into two parts:

      • Physical Data Channels
      • Physical Control Channels

      Physical data channels

      Physical data channels are distinguished by the ways in which the physical channel processor manipulates them, and by the ways in which they are mapped onto the symbols and sub-carriers used by Orthogonal frequency-division multiplexing (OFDMA). Following table lists the physical data channels that are used by LTE:

      Channel Name Acronym Downlink Uplink
      Physical downlink shared channel PDSCH X
      Physical broadcast channel PBCH X
      Physical multicast channel PMCH X
      Physical uplink shared channel PUSCH X
      Physical random access channel PRACH X

      The transport channel processor composes several types of control information, to support the low-level operation of the physical layer. These are listed in the below table:

      Field Name Acronym Downlink Uplink
      Downlink control information DCI X
      Control format indicator CFI X
      Hybrid ARQ indicator HI X
      Uplink control information UCI X

      Physical Control Channels

      The transport channel processor also creates control information that supports the low-level operation of the physical layer and sends this information to the physical channel processor in the form of physical control channels.

      The information travels as far as the transport channel processor in the receiver, but is completely invisible to higher layers. Similarly, the physical channel processor creates physical signals, which support the lowest-level aspects of the system.

      Physical Control Channels are listed in the below table:

      Channel Name Acronym Downlink Uplink
      Physical control format indicator channel PCFICH X
      Physical hybrid ARQ indicator channel PHICH X
      Physical downlink control channel PDCCH X
      Relay physical downlink control channel R-PDCCH X
      Physical uplink control channel PUCCH X

      The base station also transmits two other physical signals, which help the mobile acquire the base station after it first switches on. These are known as the primary synchronization signal (PSS) and the secondary synchronization signal (SSS).

      To overcome the effect of multi path fading problem available in UMTS, LTE uses Orthogonal Frequency Division Multiplexing (OFDM) for the downlink – that is, from the base station to the terminal to transmit the data over many narrow band careers of 180 KHz each instead of spreading one signal over the complete 5MHz career bandwidth ie. OFDM uses a large number of narrow sub-carriers for multi-carrier transmission to carry data.

      Orthogonal frequency-division multiplexing (OFDM), is a frequency-division multiplexing (FDM) scheme used as a digital multi-carrier modulation method.

      OFDM meets the LTE requirement for spectrum flexibility and enables cost-efficient solutions for very wide carriers with high peak rates. The basic LTE downlink physical resource can be seen as a time-frequency grid, as illustrated in Figure below:

      The OFDM symbols are grouped into resource blocks. The resource blocks have a total size of 180kHz in the frequency domain and 0.5ms in the time domain. Each 1ms Transmission Time Interval (TTI) consists of two slots (Tslot).

      LTE OFDMEach user is allocated a number of so-called resource blocks in the time.frequency grid. The more resource blocks a user gets, and the higher the modulation used in the resource elements, the higher the bit-rate. Which resource blocks and how many the user gets at a given point in time depend on advanced scheduling mechanisms in the frequency and time dimensions.

      The scheduling mechanisms in LTE are similar to those used in HSPA, and enable optimal performance for different services in different radio environments.

      Advantages of OFDM

      • The primary advantage of OFDM over single-carrier schemes is its ability to cope with severe channel conditions (for example, attenuation of high frequencies in a long copper wire, narrowband interference and frequency-selective fading due to multipath) without complex equalization filters.
      • Channel equalization is simplified because OFDM may be viewed as using many slowly-modulated narrowband signals rather than one rapidly-modulated wideband signal.
      • The low symbol rate makes the use of a guard interval between symbols affordable, making it possible to eliminate inter symbol interference (ISI).
      • This mechanism also facilitates the design of single frequency networks (SFNs), where several adjacent transmitters send the same signal simultaneously at the same frequency, as the signals from multiple distant transmitters may be combined constructively, rather than interfering as would typically occur in a traditional single-carrier system.

      Drawbacks of OFDM

      • High peak-to-average ratio
      • Sensitive to frequency offset, hence to Doppler-shift as well

      SC-FDMA Technology

      LTE uses a pre-coded version of OFDM called Single Carrier Frequency Division Multiple Access (SC-FDMA) in the uplink. This is to compensate for a drawback with normal OFDM, which has a very high Peak to Average Power Ratio (PAPR).

      High PAPR requires expensive and inefficient power amplifiers with high requirements on linearity, which increases the cost of the terminal and drains the battery faster.

      SC-FDMA solves this problem by grouping together the resource blocks in such a way that reduces the need for linearity, and so power consumption, in the power amplifier. A low PAPR also improves coverage and the cell-edge performance.

      Term Description
      3GPP 3rd Generation Partnership Project
      3GPP2 3rd Generation Partnership Project 2
      ARIB Association of Radio Industries and Businesses
      ATIS Alliance for Telecommunication Industry Solutions
      AWS Advanced Wireless Services
      CAPEX Capital Expenditure
      CCSA China Communications Standards Association
      CDMA Code Division Multiple Access
      CDMA2000 Code Division Multiple Access 2000
      DAB Digital Audio Broadcast
      DSL Digital Subscriber Line
      DVB Digital Video Broadcast
      eHSPA evolved High Speed Packet Access
      ETSI European Telecommunications Standards Institute
      FDD Frequency Division Duplex
      FWT Fixed Wireless Terminal
      GSM Global System for Mobile communication
      HSPA High Speed Packet Access
      HSS Home Subscriber Server
      IEEE Institute of Electrical and Electronics Engineers
      IPTV Internet Protocol Television
      LTE Long Term Evolution
      MBMS Multimedia Broadcast Multicast Service
      MIMO Multiple Input Multiple Output
      MME Mobility Management Entity
      NGMN Next Generation Mobile Networks
      OFDM Orthogonal Frequency Division Multiplexing
      OPEX Operational Expenditure
      PAPR Peak to Average Power Ratio
      PCI Peripheral Component Interconnect
      PCRF Policing and Charging Rules Function
      PDSN Packet Data Serving Node
      PS Packet Switched
      QoS Quality of Service
      RAN Radio Access Network
      SAE System Architecture Evolution
      SC-FDMA Single Carrier Frequency Division Multiple Access
      SGSN Serving GPRS Support Node
      TDD Time Division Duplex
      TTA Telecommunications Technology Association
      TTC Telecommunication Technology Committee
      TTI Transmission Time Interval
      UTRA Universal Terrestrial Radio Access
      UTRAN Universal Terrestrial Radio Access Network
      WCDMA Wideband Code Division Multiple Access
      WLAN Wireless Local Area Network
| Leave a comment

Apache Avro – Generic Data Serialization System

Data serialization is a mechanism to translate data in computer environment (like memory buffer, data structures or object state) into binary or textual form that can be transported over network or stored in some persistent storage media.

Java and Hadoop provides serialization APIs, which are java based, but Avro is not only language independent but also it is schema-based. We shall explore more difference among them in coming chapter

What is Avro ?

Apache Avro is a language-neutral data serialization system. It was developed by Doug Cutting, the father of Hadoop. Since Hadoop writable classes lack language portability, Avro becomes quite helpful, as it deals with data formats that can be processed by multiple languages. Avro is a preferred tool to serialize data in Hadoop.

Avro has a schema-based system. A language-independent schema is associated with its read and write operations. Avro serializes the data which has a built-in schema. Avro serializes the data into a compact binary format, which can be deserialized by any application.

Avro uses JSON format to declare the data structures. Presently, it supports languages such as Java, C, C++, C#, Python, and Ruby.

Avro Schemas

Avro depends heavily on its schema. It allows every data to be written with no prior knowledge of the schema. It serializes fast and the resulting serialized data is lesser in size. Schema is stored along with the Avro data in a file for any further processing.

In RPC, the client and the server exchange schemas during the connection. This exchange helps in the communication between same named fields, missing fields, extra fields, etc.

Avro schemas are defined with JSON that simplifies its implementation in languages with JSON libraries.

Like Avro, there are other serialization mechanisms in Hadoop such as Sequence Files, Protocol Buffers, and Thrift.

Thrift & Protocol Buffers Vs. Avro

Thrift and Protocol Buffers are the most competent libraries with Avro. Avro differs from these frameworks in the following ways −

  • Avro supports both dynamic and static types as per the requirement. Protocol Buffers and Thrift use Interface Definition Languages (IDLs) to specify schemas and their types. These IDLs are used to generate code for serialization and deserialization.
  • Avro is built in the Hadoop ecosystem. Thrift and Protocol Buffers are not built in Hadoop ecosystem.

Unlike Thrift and Protocol Buffer, Avro’s schema definition is in JSON and not in any proprietary IDL.

Property Avro Thrift & Protocol Buffer
Dynamic schema Yes No
Built into Hadoop Yes No
Schema in JSON Yes No
No need to compile Yes No
No need to declare IDs Yes No
Bleeding edge Yes No

Features of Avro

Listed below are some of the prominent features of Avro −

  • Avro is a language-neutral data serialization system.
  • It can be processed by many languages (currently C, C++, C#, Java, Python, and Ruby).
  • Avro creates binary structured format that is both compressible and splittable. Hence it can be efficiently used as the input to Hadoop MapReduce jobs.
  • Avro provides rich data structures. For example, you can create a record that contains an array, an enumerated type, and a sub record. These datatypes can be created in any language, can be processed in Hadoop, and the results can be fed to a third language.
  • Avro schemas defined in JSON, facilitate implementation in the languages that already have JSON libraries.
  • Avro creates a self-describing file named Avro Data File, in which it stores data along with its schema in the metadata section.
  • Avro is also used in Remote Procedure Calls (RPCs). During RPC, client and server exchange schemas in the connection handshake.

How to use Avro?

To use Avro, you need to follow the given workflow −

  • Step 1 − Create schemas. Here you need to design Avro schema according to your data.
  • Step 2 − Read the schemas into your program. It is done in two ways −
    • By Generating a Class Corresponding to Schema − Compile the schema using Avro. This generates a class file corresponding to the schema
    • By Using Parsers Library − You can directly read the schema using parsers library.
  • Step 3 − Serialize the data using the serialization API provided for Avro, which is found in the package org.apache.avro.specific.
  • Step 4 − Deserialize the data using deserialization API provided for Avro, which is found in the package org.apache.avro.specific.

What is Serialization?

Serialization is the process of translating data structures or objects state into binary or textual form to transport the data over network or to store on some persisten storage. Once the data is transported over network or retrieved from the persistent storage, it needs to be deserialized again. Serialization is termed as marshalling and deserialization is termed as unmarshalling.

Serialization in Java

Java provides a mechanism, called object serialization where an object can be represented as a sequence of bytes that includes the object’s data as well as information about the object’s type and the types of data stored in the object.

After a serialized object is written into a file, it can be read from the file and deserialized. That is, the type information and bytes that represent the object and its data can be used to recreate the object in memory.

ObjectInputStream and ObjectOutputStream classes are used to serialize and deserialize an object respectively in Java.

Serialization in Hadoop

Generally in distributed systems like Hadoop, the concept of serialization is used for Interprocess Communication and Persistent Storage.

Interprocess Communication

  • To establish the interprocess communication between the nodes connected in a network, RPC technique was used.
  • RPC used internal serialization to convert the message into binary format before sending it to the remote node via network. At the other end the remote system deserializes the binary stream into the original message.
  • The RPC serialization format is required to be as follows −
    • Compact − To make the best use of network bandwidth, which is the most scarce resource in a data center.
    • Fast − Since the communication between the nodes is crucial in distributed systems, the serialization and deserialization process should be quick, producing less overhead.
    • Extensible − Protocols change over time to meet new requirements, so it should be straightforward to evolve the protocol in a controlled manner for clients and servers.
    • Interoperable − The message format should support the nodes that are written in different languages.

Persistent Storage

Persistent Storage is a digital storage facility that does not lose its data with the loss of power supply. For example – Magnetic disks and Hard Disk Drives.

Writable Interface

This is the interface in Hadoop which provides methods for serialization and deserialization. The following table describes the methods −

S.No. Methods and Description
1 void readFields(DataInput in)

This method is used to deserialize the fields of the given object.

2 void write(DataOutput out)

This method is used to serialize the fields of the given object.

WritableComparable Interface

It is the combination of Writable and Comparable interfaces. This interface inherits Writable interface of Hadoop as well as Comparable interface of Java. Therefore it provides methods for data serialization, deserialization, and comparison.

S.No. Methods and Description
1 int compareTo(class obj)

This method compares current object with the given object obj.

In addition to these classes, Hadoop supports a number of wrapper classes that implement WritableComparable interface. Each class wraps a Java primitive type. The class hierarchy of Hadoop serialization is given below −

Hadoop Serialization Hierarchy

These classes are useful to serialize various types of data in Hadoop. For instance, let us consider the IntWritable class. Let us see how this class is used to serialize and deserialize the data in Hadoop.

IntWritable Class

This class implements Writable, Comparable, and WritableComparable interfaces. It wraps an integer data type in it. This class provides methods used to serialize and deserialize integer type of data.

Constructors

S.No. Summary
1 IntWritable()
2 IntWritable( int value)

Methods

S.No. Summary
1 int get()

Using this method you can get the integer value present in the current object.

2 void readFields(DataInput in)

This method is used to deserialize the data in the given DataInput object.

3 void set(int value)

This method is used to set the value of the current IntWritable object.

4 void write(DataOutput out)

This method is used to serialize the data in the current object to the given DataOutput object.

Serializing the Data in Hadoop

The procedure to serialize the integer type of data is discussed below.

  • Instantiate IntWritable class by wrapping an integer value in it.
  • Instantiate ByteArrayOutputStream class.
  • Instantiate DataOutputStream class and pass the object of ByteArrayOutputStream class to it.
  • Serialize the integer value in IntWritable object using write() method. This method needs an object of DataOutputStream class.
  • The serialized data will be stored in the byte array object which is passed as parameter to the DataOutputStream class at the time of instantiation. Convert the data in the object to byte array.

Example

The following example shows how to serialize data of integer type in Hadoop −

import java.io.ByteArrayOutputStream;
import java.io.DataOutputStream;

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;

public class Serialization {
   public byte[] serialize() throws IOException{
		
      //Instantiating the IntWritable object
      IntWritable intwritable = new IntWritable(12);
   
      //Instantiating ByteArrayOutputStream object
      ByteArrayOutputStream byteoutputStream = new ByteArrayOutputStream();
   
      //Instantiating DataOutputStream object
      DataOutputStream dataOutputStream = new
      DataOutputStream(byteoutputStream);
   
      //Serializing the data
      intwritable.write(dataOutputStream);
   
      //storing the serialized object in bytearray
      byte[] byteArray = byteoutputStream.toByteArray();
   
      //Closing the OutputStream
      dataOutputStream.close();
      return(byteArray);
   }
	
   public static void main(String args[]) throws IOException{
      Serialization serialization= new Serialization();
      serialization.serialize();
      System.out.println();
   }
}

Deserializing the Data in Hadoop

The procedure to deserialize the integer type of data is discussed below −

  • Instantiate IntWritable class by wrapping an integer value in it.
  • Instantiate ByteArrayOutputStream class.
  • Instantiate DataOutputStream class and pass the object of ByteArrayOutputStream class to it.
  • Deserialize the data in the object of DataInputStream using readFields() method of IntWritable class.
  • The deserialized data will be stored in the object of IntWritable class. You can retrieve this data using get() method of this class.

Example

The following example shows how to deserialize the data of integer type in Hadoop −

import java.io.ByteArrayInputStream;
import java.io.DataInputStream;

import org.apache.hadoop.io.IntWritable;

public class Deserialization {

   public void deserialize(byte[]byteArray) throws Exception{
   
      //Instantiating the IntWritable class
      IntWritable intwritable =new IntWritable();
      
      //Instantiating ByteArrayInputStream object
      ByteArrayInputStream InputStream = new ByteArrayInputStream(byteArray);
      
      //Instantiating DataInputStream object
      DataInputStream datainputstream=new DataInputStream(InputStream);
      
      //deserializing the data in DataInputStream
      intwritable.readFields(datainputstream);
      
      //printing the serialized data
      System.out.println((intwritable).get());
   }
   
   public static void main(String args[]) throws Exception {
      Deserialization dese = new Deserialization();
      dese.deserialize(new Serialization().serialize());
   }
}

Advantage of Hadoop over Java Serialization

Hadoop’s Writable-based serialization is capable of reducing the object-creation overhead by reusing the Writable objects, which is not possible with the Java’s native serialization framework.

Disadvantages of Hadoop Serialization

To serialize Hadoop data, there are two ways −

  • You can use the Writable classes, provided by Hadoop’s native library.
  • You can also use Sequence Files which store the data in binary format.

The main drawback of these two mechanisms is that Writables and SequenceFiles have only a Java API and they cannot be written or read in any other language.

Therefore any of the files created in Hadoop with above two mechanisms cannot be read by any other third language, which makes Hadoop as a limited box. To address this drawback, Doug Cutting created Avro, which is a language independent data structure.

Apache software foundation provides Avro with various releases. You can download the required release from Apache mirrors. Let us see, how to set up the environment to work with Avro −

Downloading Avro

To download Apache Avro, proceed with the following −

  • Open the web page Apache.org. You will see the homepage of Apache Avro as shown below −

Avro Homepage

  • Click on project → releases. You will get a list of releases.
  • Select the latest release which leads you to a download link.
  • mirror.nexcess is one of the links where you can find the list of all libraries of different languages that Avro supports as shown below −

Avro Languages Supports

You can select and download the library for any of the languages provided. In this tutorial, we use Java. Hence download the jar files avro-1.7.7.jar and avro-tools-1.7.7.jar.

Avro with Eclipse

To use Avro in Eclipse environment, you need to follow the steps given below −

  • Step 1. Open eclipse.
  • Step 2. Create a project.
  • Step 3. Right-click on the project name. You will get a shortcut menu.
  • Step 4. Click on Build Path. It leads you to another shortcut menu.
  • Step 5. Click on Configure Build Path… You can see Properties window of your project as shown below −

Properties of Avro

  • Step 6. Under libraries tab, click on ADD EXternal JARs… button.
  • Step 7. Select the jar file avro-1.77.jar you have downloaded.
  • Step 8. Click on OK.

Avro with Maven

You can also get the Avro library into your project using Maven. Given below is the pom.xml file for Avro.

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="   http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">

   <modelVersion>4.0.0</modelVersion>
   <groupId>Test</groupId>
   <artifactId>Test</artifactId>
   <version>0.0.1-SNAPSHOT</version>

   <build>
      <sourceDirectory>src</sourceDirectory>
      <plugins>
         <plugin>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>3.1</version>
		
            <configuration>
               <source>1.7</source>
               <target>1.7</target>
            </configuration>
		
         </plugin>
      </plugins>
   </build>

   <dependencies>
      <dependency>
         <groupId>org.apache.avro</groupId>
         <artifactId>avro</artifactId>
         <version>1.7.7</version>
      </dependency>
	
      <dependency>
         <groupId>org.apache.avro</groupId>
         <artifactId>avro-tools</artifactId>
         <version>1.7.7</version>
      </dependency>
	
      <dependency>
         <groupId>org.apache.logging.log4j</groupId>
         <artifactId>log4j-api</artifactId>
         <version>2.0-beta9</version>
      </dependency>
	
      <dependency>
         <groupId>org.apache.logging.log4j</groupId>
         <artifactId>log4j-core</artifactId>
         <version>2.0-beta9</version>
      </dependency>
	
   </dependencies>

</project>

Setting Classpath

To work with Avro in Linux environment, download the following jar files −

  • avro-1.77.jar
  • avro-tools-1.77.jar
  • log4j-api-2.0-beta9.jar
  • og4j-core-2.0.beta9.jar.

Copy these files into a folder and set the classpath to the folder, in the ./bashrc file as shown below.

#class path for Avro
export CLASSPATH=$CLASSPATH://home/Hadoop/Avro_Work/jars/*

Setting CLASSPATH

Avro, being a schema-based serialization utility, accepts schemas as input. In spite of various schemas being available, Avro follows its own standards of defining schemas. These schemas describe the following details −

  • type of file (record by default)
  • location of record
  • name of the record
  • fields in the record with their corresponding data types

Using these schemas, you can store serialized values in binary format using less space. These values are stored without any metadata.

Creating Avro Schemas

The Avro schema is created in JavaScript Object Notation (JSON) document format, which is a lightweight text-based data interchange format. It is created in one of the following ways −

  • A JSON string
  • A JSON object
  • A JSON array

Example − The given schema defines a (record type) document within “Tutorialspoint” namespace. The name of document is “Employee” which contains two “Fields” → Name and Age.

{
   "type" : "record",
   "namespace" : "sample",
   "name" : "Employee",
   "fields" : [
      { "name" : "Name" , "type" : "string" },
      { "name" : "Age" , "type" : "int" }
   ]
}

We observed that schema contains four attributes, they are briefly described below −

  • type − Describes document type, in this case a “record”.
  • namespace − Describes the name of the namespace in which the object resides.
  • name − Describes the schema name.
  • fields − This is an attribute array which contains the following −
    • name − Describes the name of field
    • type − Describes data type of field

Primitive Data Types of Avro

Avro schema is having primitive data types as well as complex data types. The following table describes the primitive data types of Avro −

Data type Description
null Null is a type having no value.
int 32-bit signed integer.
long 64-bit signed integer.
float single precision (32-bit) IEEE 754 floating-point number.
double double precision (64-bit) IEEE 754 floating-point number.
bytes sequence of 8-bit unsigned bytes.
string Unicode character sequence.

Complex Data Types of Avro

Along with primitive data types, Avro provides six complex data types namely Records, Enums, Arrays, Maps, Unions, and Fixed.

Record

As we know already by now, a record data type in Avro is a collection of multiple attributes. It supports the following attributes −

  • name
  • namespace
  • type
  • fields

Enum

An enumeration is a list of items in a collection, Avro enumeration supports the following attributes −

  • name − The value of this field holds the name of the enumeration.
  • namespace − The value of this field contains the string that qualifies the name of the Enumeration.
  • symbols − The value of this field holds the enum’s symbols as an array of names.

Example

Given below is the example of an enumeration.

{
   "type" : "enum",
   "name" : "Numbers", "namespace": "data", "symbols" : [ "ONE", "TWO", "THREE", "FOUR" ]
}

Arrays

This data type defines an array field having a single attribute items. This items attribute specifies the type of items in the array.

Example

{ " type " : " array ", " items " : " int " }

Maps

The map data type is an array of key-value pairs. The values attribute holds the data type of the content of map. Avro map values are implicitly taken as strings. The below example shows map from string to int.

Example

{"type" : "map", "values" : "int"}

Unions

A union datatype is used whenever the field has one or more datatypes. They are represented as JSON arrays. For example, if a field that could be either an int or null, then the union is represented as [“int”, “null”].

Example

Given below is an example document using unions −

{ 
   "type" : "record", 
   "namespace" : "sample.com", 
   "name" : "empdetails ", 
   "fields" : 
   [ 
      { "name" : "experience", "type": ["int", "null"] }, { "name" : "age", "type": "int" } 
   ] 
}

Fixed

This data type is used to declare a fixed-sized field that can be used for storing binary data. It has field name and data as attributes. Name holds the name of the field, and size holds the size of the field.

Example

{ "type" : "fixed" , "name" : "bdata", "size" : 1048576}

In the previous chapter, we described the input type of Avro, i.e., Avro schemas. In this chapter, we will explain the classes and methods used in the serialization and deserialization of Avro schemas.

SpecificDatumWriter Class

This class belongs to the package org.apache.avro.specific. It implements the DatumWriter interface which converts Java objects into an in-memory serialized format.

Constructor

S.No. Description
1 SpecificDatumWriter(Schema schema)

Method

S.No. Description
1 SpecificData getSpecificData()

Returns the SpecificData implementation used by this writer.

SpecificDatumReader Class

This class belongs to the package org.apache.avro.specific. It implements the DatumReader interface which reads the data of a schema and determines in-memory data representation. SpecificDatumReader is the class which supports generated java classes.

Constructor

S.No. Description
1 SpecificDatumReader(Schema schema)

Construct where the writer’s and reader’s schemas are the same.

Methods

S.No. Description
1 SpecificData getSpecificData()

Returns the contained SpecificData.

2 void setSchema(Schema actual)

This method is used to set the writer’s schema.

DataFileWriter

Instantiates DataFileWrite for emp class. This class writes a sequence serialized records of data conforming to a schema, along with the schema in a file.

Constructor

S.No. Description
1 DataFileWriter(DatumWriter<D> dout)

Methods

S.No Description
1 void append(D datum)

Appends a datum to a file.

2 DataFileWriter<D> appendTo(File file)

This method is used to open a writer appending to an existing file.

Data FileReader

This class provides random access to files written with DataFileWriter. It inherits the class DataFileStream.

Constructor

S.No. Description
1 DataFileReader(File file, DatumReader<D> reader))

Methods

S.No. Description
1 next()

Reads the next datum in the file.

2 Boolean hasNext()

Returns true if more entries remain in this file.

Class Schema.parser

This class is a parser for JSON-format schemas. It contains methods to parse the schema. It belongs to org.apache.avro package.

Constructor

S.No. Description
1 Schema.Parser()

Methods

S.No. Description
1 parse (File file)

Parses the schema provided in the given file.

2 parse (InputStream in)

Parses the schema provided in the given InputStream.

3 parse (String s)

Parses the schema provided in the given String.

Interface GenricRecord

This interface provides methods to access the fields by name as well as index.

Methods

S.No. Description
1 Object get(String key)

Returns the value of a field given.

2 void put(String key, Object v)

Sets the value of a field given its name.

Class GenericData.Record

Constructor

S.No. Description
1 GenericData.Record(Schema schema)

Methods

S.No. Description
1 Object get(String key)

Returns the value of a field of the given name.

2 Schema getSchema()

Returns the schema of this instance.

3 void put(int i, Object v)

Sets the value of a field given its position in the schema.

4 void put(String key, Object value)

Sets the value of a field given its name.

One can read an Avro schema into the program either by generating a class corresponding to a schema or by using the parsers library. This chapter describes how to read the schema by generating a class and serialize the data using Avro.

The following is a depiction of serializing the data with Avro by generating a class. Here, emp.avsc is the schema file which we pass as input to Avro utility.

Avro WithCode Serializing

The output of Avro is a java file.

Serialization by Generating a Class

To serialize the data using Avro, follow the steps as given below −

  • Define an Avro schema.
  • Compile the schema using Avro utility. You get the Java code corresponding to that schema.
  • Populate the schema with the data.
  • Serialize it using Avro library.

Defining a Schema

Suppose you want a schema with the following details −

Field Name id age salary address
type String int int int string

Create an Avro schema as shown below and save it as emp.avsc.

{
   "namespace": "tutorialspoint.com",
   "type": "record",
   "name": "emp",
   "fields": [
      {"name": "name", "type": "string"},
      {"name": "id", "type": "int"},
      {"name": "salary", "type": "int"},
      {"name": "age", "type": "int"},
      {"name": "address", "type": "string"}
   ]
}

Compiling the Schema

After creating the Avro schema, we need to compile it using Avro tools. Arvo tools can be located in avro-tools-1.7.7.jar file. We need to provide arvo-tools-1.7.7.jar file path at compilation.

Syntax to Compile an Avro Schema

java -jar <path/to/avro-tools-1.7.7.jar> compile schema <path/to/schema-file> <destination-folder>

Open the terminal in the home folder. Create a new directory to work with Avro as shown below −

$ mkdir Avro_Work

In the newly created directory, create three sub-directories −

  • First named schema, to place the schema.
  • Second named with_code_gen, to place the generated code.
  • Third named jars, to place the jar files.
$ mkdir schema
$ mkdir with_code_gen
$ mkdir jars

The following screenshot shows how your Avro_work folder should look like after creating all the directories.

Avro Work
  • Now /home/Hadoop/Avro_work/jars/avro-tools-1.7.7.jar is the path for the directory where you have downloaded avro-tools-1.7.7.jar file.
  • /home/Hadoop/Avro_work/schema/ is the path for the directory where your schema file emp.avsc is stored.
  • /home/Hadoop/Avro_work/with_code_gen is the directory where you want the generated class files to be stored.

Compile the schema as shown below −

$ java -jar /home/Hadoop/Avro_work/jars/avro-tools-1.7.7.jar compile schema /home/Hadoop/Avro_work/schema/emp.avsc /home/Hadoop/Avro/with_code_gen

After this compilation, a package is created in the destination directory with the name mentioned as namespace in the schema file. Within this package, the Java source file with schema name is generated. The generated file contains java code corresponding to the schema. This java file can be directly accessed by an application.

In our example, a package/folder, named tutorialspoint is created which contains another folder named com (since the name space is tutorialspoint.com) and within it, resides the generated file emp.java. The following snapshot shows emp.java

Snapshot of Sample Program

This java file is useful to create data according to schema.

The generated class contains −

  • Default constructor, and parameterized constructor which accept all the variables of the schema.
  • The setter and getter methods for all variables in the schema.
  • Get() method which returns the schema.
  • Builder methods.

Creating and Serializing the Data

First of all, copy the generated java file used in this project into the current directory or import it from where it is located.

Now we can write a new Java file and instantiate the class in the generated file (emp) to add employee data to the schema.

Let us see the procedure to create data according to the schema using apache Avro.

Step 1

Instantiate the generated emp class.

emp e1=new emp( );

Step 2

Using setter methods, insert the data of first employee. For example, we have created the details of the employee named Omar.

e1.setName("omar");
e1.setAge(21);
e1.setSalary(30000);
e1.setAddress("Hyderabad");
e1.setId(001);

Similarly, fill in all employee details using setter methods.

Step 3

Create an object of DatumWriter interface using the SpecificDatumWriter class. This converts Java objects into in-memory serialized format. The following example instantiates SpecificDatumWriter class object for emp class.

DatumWriter<emp> empDatumWriter = new SpecificDatumWriter<emp>(emp.class);

Step 4

Instantiate DataFileWriter for emp class. This class writes a sequence serialized records of data conforming to a schema, along with the schema itself, in a file. This class requires the DatumWriter object, as a parameter to the constructor.

DataFileWriter<emp> empFileWriter = new DataFileWriter<emp>(empDatumWriter);

Step 5

Open a new file to store the data matching to the given schema using create() method. This method requires the schema, and the path of the file where the data is to be stored, as parameters.

In the following example, schema is passed using getSchema() method, and the data file is stored in the path − /home/Hadoop/Avro/serialized_file/emp.avro.

empFileWriter.create(e1.getSchema(),new File("/home/Hadoop/Avro/serialized_file/emp.avro"));

Step 6

Add all the created records to the file using append() method as shown below −

empFileWriter.append(e1);
empFileWriter.append(e2);
empFileWriter.append(e3);

Example – Serialization by Generating a Class

The following complete program shows how to serialize data into a file using Apache Avro −

import java.io.File;
import java.io.IOException;

import org.apache.avro.file.DataFileWriter;
import org.apache.avro.io.DatumWriter;
import org.apache.avro.specific.SpecificDatumWriter;

public class Serialize {
   public static void main(String args[]) throws IOException{
	
      //Instantiating generated emp class
      emp e1=new emp();
	
      //Creating values according the schema
      e1.setName("omar");
      e1.setAge(21);
      e1.setSalary(30000);
      e1.setAddress("Hyderabad");
      e1.setId(001);
	
      emp e2=new emp();
	
      e2.setName("ram");
      e2.setAge(30);
      e2.setSalary(40000);
      e2.setAddress("Hyderabad");
      e2.setId(002);
	
      emp e3=new emp();
	
      e3.setName("robbin");
      e3.setAge(25);
      e3.setSalary(35000);
      e3.setAddress("Hyderabad");
      e3.setId(003);
	
      //Instantiate DatumWriter class
      DatumWriter<emp> empDatumWriter = new SpecificDatumWriter<emp>(emp.class);
      DataFileWriter<emp> empFileWriter = new DataFileWriter<emp>(empDatumWriter);
	
      empFileWriter.create(e1.getSchema(), new File("/home/Hadoop/Avro_Work/with_code_gen/emp.avro"));
	
      empFileWriter.append(e1);
      empFileWriter.append(e2);
      empFileWriter.append(e3);
	
      empFileWriter.close();
	
      System.out.println("data successfully serialized");
   }
}

Browse through the directory where the generated code is placed. In this case, at home/Hadoop/Avro_work/with_code_gen.

In Terminal −

$ cd home/Hadoop/Avro_work/with_code_gen/

In GUI −

Generated Code

Now copy and save the above program in the file named Serialize.java and compile and execute it as shown below −

$ javac Serialize.java
$ java Serialize

Output

data successfully serialized

If you verify the path given in the program, you can find the generated serialized file as shown below.

Generated Serialized File

As described earlier, one can read an Avro schema into a program either by generating a class corresponding to the schema or by using the parsers library. This chapter describes how to read the schema by generating a class and Deserialize the data using Avro.

Deserialization by Generating a Class

In our last example, the serialized data was stored in the file emp.avro. We shall now see how to deserialize it and read it using Avro. The procedure is as follows −

Step 1

Create an object of DatumReader interface using SpecificDatumReader class.

DatumReader<emp>empDatumReader = new SpecificDatumReader<emp>(emp.class);

Step 2

Instantiate DataFileReader class. This class reads serialized data from a file. It requires the DatumReader object, and path of the file (emp.avro) where the serialized data is existing , as a parameters to the constructor.

DataFileReader<emp> dataFileReader = new DataFileReader(new File("/path/to/emp.avro"), empDatumReader);

Step 3

Print the deserialized data, using the methods of DataFileReader.

  • The hasNext() method will return a boolean if there are any elements in the Reader.
  • The next() method of DataFileReader returns the data in the Reader.
while(dataFileReader.hasNext()){

   em=dataFileReader.next(em);
   System.out.println(em);
}

Example – Deserialization by Generating a Class

The following complete program shows how to deserialize the data in a file using Avro.

import java.io.File;
import java.io.IOException;

import org.apache.avro.file.DataFileReader;
import org.apache.avro.io.DatumReader;
import org.apache.avro.specific.SpecificDatumReader;

public class Deserialize {
   public static void main(String args[]) throws IOException{
	
      //DeSerializing the objects
      DatumReader<emp> empDatumReader = new SpecificDatumReader<emp>(emp.class);
		
      //Instantiating DataFileReader
      DataFileReader<emp> dataFileReader = new DataFileReader<emp>(new
         File("/home/Hadoop/Avro_Work/with_code_genfile/emp.avro"), empDatumReader);
      emp em=null;
		
      while(dataFileReader.hasNext()){
      
         em=dataFileReader.next(em);
         System.out.println(em);
      }
   }
}

Browse into the directory where the generated code is placed. In this case, at home/Hadoop/Avro_work/with_code_gen.

$ cd home/Hadoop/Avro_work/with_code_gen/

Now, copy and save the above program in the file named DeSerialize.java. Compile and execute it as shown below −

$ javac Deserialize.java
$ java Deserialize

Output

{"name": "omar", "id": 1, "salary": 30000, "age": 21, "address": "Hyderabad"}
{"name": "ram", "id": 2, "salary": 40000, "age": 30, "address": "Hyderabad"}
{"name": "robbin", "id": 3, "salary": 35000, "age": 25, "address": "Hyderabad"}

One can read an Avro schema into a program either by generating a class corresponding to a schema or by using the parsers library. In Avro, data is always stored with its corresponding schema. Therefore, we can always read a schema without code generation.

This chapter describes how to read the schema by using parsers library and to serialize the data using Avro.

The following is a depiction of serializing the data with Avro using parser libraries. Here, emp.avsc is the schema file which we pass as input to Avro utility.

Avro Without Code Serialize

Serialization Using Parsers Library

To serialize the data, we need to read the schema, create data according to the schema, and serialize the schema using the Avro API. The following procedure serializes the data without generating any code −

Step 1

First of all, read the schema from the file. To do so, use Schema.Parser class. This class provides methods to parse the schema in different formats.

Instantiate the Schema.Parser class by passing the file path where the schema is stored.

Schema schema = new Schema.Parser().parse(new File("/path/to/emp.avsc"));

Step 2

Create the object of GenericRecord interface, by instantiating GenericData.Record class. This constructor accepts a parameter of type Schema. Pass the schema object created in step 1 to its constructor as shown below −

GenericRecord e1 = new GenericData.Record(schema);

Step 3

Insert the values in the schema using the put() method of the GenericData class.

e1.put("name", "ramu");
e1.put("id", 001);
e1.put("salary",30000);
e1.put("age", 25);
e1.put("address", "chennai");

Step 4

Create an object of DatumWriter interface using the SpecificDatumWriter class. It converts Java objects into in-memory serialized format. The following example instantiates SpecificDatumWriter class object for emp class −

DatumWriter<emp> empDatumWriter = new SpecificDatumWriter<emp>(emp.class);

Step 5

Instantiate DataFileWriter for emp class. This class writes serialized records of data conforming to a schema, along with the schema itself, in a file. This class requires the DatumWriter object, as a parameter to the constructor.

DataFileWriter<emp> dataFileWriter = new DataFileWriter<emp>(empDatumWriter);

Step 6

Open a new file to store the data matching to the given schema using create() method. This method requires two parameters −

  • the schema,
  • the path of the file where the data is to be stored.

In the example given below, schema is passed using getSchema() method and the serialized data is stored in emp.avro.

empFileWriter.create(e1.getSchema(), new File("/path/to/emp.avro"));

Step 7

Add all the created records to the file using append( ) method as shown below.

empFileWriter.append(e1);
empFileWriter.append(e2);
empFileWriter.append(e3);

Example – Serialization Using Parsers

The following complete program shows how to serialize the data using parsers −

import java.io.File;
import java.io.IOException;

import org.apache.avro.Schema;
import org.apache.avro.file.DataFileWriter;

import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericDatumWriter;
import org.apache.avro.generic.GenericRecord;

import org.apache.avro.io.DatumWriter;

public class Seriali {
   public static void main(String args[]) throws IOException{
	
      //Instantiating the Schema.Parser class.
      Schema schema = new Schema.Parser().parse(new File("/home/Hadoop/Avro/schema/emp.avsc"));
		
      //Instantiating the GenericRecord class.
      GenericRecord e1 = new GenericData.Record(schema);
		
      //Insert data according to schema
      e1.put("name", "ramu");
      e1.put("id", 001);
      e1.put("salary",30000);
      e1.put("age", 25);
      e1.put("address", "chenni");
		
      GenericRecord e2 = new GenericData.Record(schema);
		
      e2.put("name", "rahman");
      e2.put("id", 002);
      e2.put("salary", 35000);
      e2.put("age", 30);
      e2.put("address", "Delhi");
		
      DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(schema);
		
      DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(datumWriter);
      dataFileWriter.create(schema, new File("/home/Hadoop/Avro_work/without_code_gen/mydata.txt"));
		
      dataFileWriter.append(e1);
      dataFileWriter.append(e2);
      dataFileWriter.close();
		
      System.out.println(“data successfully serialized”);
   }
}

Browse into the directory where the generated code is placed. In this case, at home/Hadoop/Avro_work/without_code_gen.

$ cd home/Hadoop/Avro_work/without_code_gen/
Without Code Gen

Now copy and save the above program in the file named Serialize.java. Compile and execute it as shown below −

$ javac Serialize.java
$ java Serialize

Output

data successfully serialized

If you verify the path given in the program, you can find the generated serialized file as shown below.

Without Code Gen1

As described earlier, one can read an Avro schema into a program either by generating a class corresponding to the schema or by using the parsers library. This chapter describes how to read the schema by using parser library and Deserialize the data using Avro.

Deserialization Using Parsers Library

In our last example, the serialized data was stored in the file mydata.txt. We shall now see how to deserialize it and read it using Avro. The procedure is as follows −

Step 1

First of all, read the schema from the file. To do so, use Schema.Parser class. This class provides methods to parse the schema in different formats.

Instantiate the Schema.Parser class by passing the file path where the schema is stored.

Schema schema = new Schema.Parser().parse(new File("/path/to/emp.avsc"));

Step 2

Create an object of DatumReader interface using SpecificDatumReader class.

DatumReader<emp>empDatumReader = new SpecificDatumReader<emp>(emp.class);

Step 3

Instantiate DataFileReader class. This class reads serialized data from a file. It requires the DatumReader object, and path of the file where the serialized data exists, as a parameters to the constructor.

DataFileReader<GenericRecord> dataFileReader = new DataFileReader<GenericRecord>(new File("/path/to/mydata.txt"), datumReader);

Step 4

Print the deserialized data, using the methods of DataFileReader.

  • The hasNext() method returns a boolean if there are any elements in the Reader.
  • The next() method of DataFileReader returns the data in the Reader.
while(dataFileReader.hasNext()){

   em=dataFileReader.next(em);
   System.out.println(em);
}

Example – Deserialization Using Parsers Library

The following complete program shows how to deserialize the serialized data using Parsers library −

public class Deserialize {
   public static void main(String args[]) throws Exception{
	
      //Instantiating the Schema.Parser class.
      Schema schema = new Schema.Parser().parse(new File("/home/Hadoop/Avro/schema/emp.avsc"));
      DatumReader<GenericRecord> datumReader = new GenericDatumReader<GenericRecord>(schema);
      DataFileReader<GenericRecord> dataFileReader = new DataFileReader<GenericRecord>(new File("/home/Hadoop/Avro_Work/without_code_gen/mydata.txt"), datumReader);
      GenericRecord emp = null;
		
      while (dataFileReader.hasNext()) {
         emp = dataFileReader.next(emp);
         System.out.println(emp);
      }
      System.out.println("hello");
   }
}

Browse into the directory where the generated code is placed. In this case, it is at home/Hadoop/Avro_work/without_code_gen.

$ cd home/Hadoop/Avro_work/without_code_gen/

Now copy and save the above program in the file named DeSerialize.java. Compile and execute it as shown below −

$ javac Deserialize.java
$ java Deserialize

Output

{"name": "ramu", "id": 1, "salary": 30000, "age": 25, "address": "chennai"}
{"name": "rahman", "id": 2, "salary": 35000, "age": 30, "address": "Delhi"}
| Leave a comment

Apache Mahout – BigData Analysis

A mahout is one who drives an elephant as its master. The name comes from its close association with Apache Hadoop which uses an elephant as its logo.

Hadoop is an open-source framework from Apache that allows to store and process big data in a distributed environment across clusters of computers using simple programming models.

Apache Mahout is an open source project that is primarily used for creating scalable machine learning algorithms. It implements popular machine learning techniques such as:

Recommendation
Classification
Clustering

Apache Mahout started as a sub-project of Apache’s Lucene in 2008. In 2010, Mahout became a top level project of Apache.

Features of Mahout

The primitive features of Apache Mahout are listed below.

The algorithms of Mahout are written on top of Hadoop, so it works well in distributed environment. Mahout uses the Apache Hadoop library to scale effectively in the cloud.

Mahout offers the coder a ready-to-use framework for doing data mining tasks on large volumes of data.

Mahout lets applications to analyze large sets of data effectively and in quick time.

Includes several MapReduce enabled clustering implementations such as k-means, fuzzy k-means, Canopy, Dirichlet, and Mean-Shift.

Supports Distributed Naive Bayes and Complementary Naive Bayes classification implementations.

Comes with distributed fitness function capabilities for evolutionary programming.

Includes matrix and vector libraries.

Applications of Mahout

Companies such as Adobe, Facebook, LinkedIn, Foursquare, Twitter, and Yahoo use Mahout internally.

Foursquare helps you in finding out places, food, and entertainment available in a particular area. It uses the recommender engine of Mahout.

Twitter uses Mahout for user interest modelling.

Yahoo! uses Mahout for pattern mining.

What is Machine Learning?

Machine learning is a branch of science that deals with programming the systems in such a way that they automatically learn and improve with experience. Here, learning means recognizing and understanding the input data and making wise decisions based on the supplied data.

It is very difficult to cater to all the decisions based on all possible inputs. To tackle this problem, algorithms are developed. These algorithms build knowledge from specific data and past experience with the principles of statistics, probability theory, logic, combinatorial optimization, search, reinforcement learning, and control theory.

The developed algorithms form the basis of various applications such as:

  • Vision processing
  • Language processing
  • Forecasting (e.g., stock market trends)
  • Pattern recognition
  • Games
  • Data mining
  • Expert systems
  • Robotics

Machine learning is a vast area and it is quite beyond the scope of this tutorial to cover all its features. There are several ways to implement machine learning techniques, however the most commonly used ones are supervised and unsupervised learning.

Supervised Learning

Supervised learning deals with learning a function from available training data. A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. Common examples of supervised learning include:

  • classifying e-mails as spam,
  • labeling webpages based on their content, and
  • voice recognition.

There are many supervised learning algorithms such as neural networks, Support Vector Machines (SVMs), and Naive Bayes classifiers. Mahout implements Naive Bayes classifier.

Unsupervised Learning

Unsupervised learning makes sense of unlabeled data without having any predefined dataset for its training. Unsupervised learning is an extremely powerful tool for analyzing available data and look for patterns and trends. It is most commonly used for clustering similar input into logical groups. Common approaches to unsupervised learning include:

  • k-means
  • self-organizing maps, and
  • hierarchical clustering

Recommendation

Recommendation is a popular technique that provides close recommendations based on user information such as previous purchases, clicks, and ratings.

  • Amazon uses this technique to display a list of recommended items that you might be interested in, drawing information from your past actions. There are recommender engines that work behind Amazon to capture user behavior and recommend selected items based on your earlier actions.
  • Facebook uses the recommender technique to identify and recommend the “people you may know list”.

Classification

Classification, also known as categorization, is a machine learning technique that uses known data to determine how the new data should be classified into a set of existing categories. Classification is a form of supervised learning.

  • Mail service providers such as Yahoo! and Gmail use this technique to decide whether a new mail should be classified as a spam. The categorization algorithm trains itself by analyzing user habits of marking certain mails as spams. Based on that, the classifier decides whether a future mail should be deposited in your inbox or in the spams folder.
  • iTunes application uses classification to prepare playlists.

Classification

Clustering

Clustering is used to form groups or clusters of similar data based on common characteristics. Clustering is a form of unsupervised learning.

  • Search engines such as Google and Yahoo! use clustering techniques to group data with similar characteristics.
  • Newsgroups use clustering techniques to group various articles based on related topics.

The clustering engine goes through the input data completely and based on the characteristics of the data, it will decide under which cluster it should be grouped. Take a look at the following example.

 

Our library of tutorials contains topics on various subjects. When we receive a new tutorial at TutorialsPoint, it gets processed by a clustering engine that decides, based on its content, where it should be grouped.

 

This chapter teaches you how to setup mahout. Java and Hadoop are the prerequisites of mahout. Below given are the steps to download and install Java, Hadoop, and Mahout.

Pre-Installation Setup

Before installing Hadoop into Linux environment, we need to set up Linux using ssh (Secure Shell). Follow the steps mentioned below for setting up the Linux environment.

Creating a User

It is recommended to create a separate user for Hadoop to isolate the Hadoop file system from the Unix file system. Follow the steps given below to create a user:

  • Open root using the command “su”.
  • Create a user from the root account using the command “useradd username”.
  • Now you can open an existing user account using the command “su username”.
  • Open the Linux terminal and type the following commands to create a user.
$ su
password:
# useradd hadoop
# passwd hadoop
New passwd:
Retype new passwd

SSH Setup and Key Generation

SSH setup is required to perform different operations on a cluster such as starting, stopping, and distributed daemon shell operations. To authenticate different users of Hadoop, it is required to provide public/private key pair for a Hadoop user and share it with different users.

The following commands are used to generate a key value pair using SSH, copy the public keys form id_rsa.pub to authorized_keys, and provide owner, read and write permissions to authorized_keys file respectively.

$ ssh-keygen -t rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 0600 ~/.ssh/authorized_keys

Verifying ssh

ssh localhost

Installing Java

Java is the main prerequisite for Hadoop and HBase. First of all, you should verify the existence of Java in your system using “java -version”. The syntax of Java version command is given below.

$ java -version

It should produce the following output.

java version "1.7.0_71"
Java(TM) SE Runtime Environment (build 1.7.0_71-b13)
Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)

If you don’t have Java installed in your system, then follow the steps given below for installing Java.

Step 1

Download java (JDK <latest version> – X64.tar.gz) by visiting the following link: Oracle

Then jdk-7u71-linux-x64.tar.gz is downloaded onto your system.

Step 2

Generally, you find the downloaded Java file in the Downloads folder. Verify it and extract the jdk-7u71-linux-x64.gz file using the following commands.

$ cd Downloads/
$ ls
jdk-7u71-linux-x64.gz
$ tar zxf jdk-7u71-linux-x64.gz
$ ls
jdk1.7.0_71 jdk-7u71-linux-x64.gz

Step 3

To make Java available to all the users, you need to move it to the location “/usr/local/”. Open root, and type the following commands.

$ su
password:
# mv jdk1.7.0_71 /usr/local/
# exit

Step 4

For setting up PATH and JAVA_HOME variables, add the following commands to ~/.bashrc file.

export JAVA_HOME=/usr/local/jdk1.7.0_71
export PATH= $PATH:$JAVA_HOME/bin

Now, verify the java -version command from terminal as explained above.

Downloading Hadoop

After installing Java, you need to install Hadoop initially. Verify the existence of Hadoop using “Hadoop version” command as shown below.

hadoop version

It should produce the following output:

Hadoop 2.6.0
Compiled by jenkins on 2014-11-13T21:10Z
Compiled with protoc 2.5.0
From source with checksum 18e43357c8f927c0695f1e9522859d6a
This command was run using /home/hadoop/hadoop/share/hadoop/common/hadoopcommon-2.6.0.jar

If your system is unable to locate Hadoop, then download Hadoop and have it installed on your system. Follow the commands given below to do so.

Download and extract hadoop-2.6.0 from apache software foundation using the following commands.

$ su
password:
# cd /usr/local
# wget http://mirrors.advancedhosters.com/apache/hadoop/common/hadoop-
2.6.0/hadoop-2.6.0-src.tar.gz
# tar xzf hadoop-2.6.0-src.tar.gz
# mv hadoop-2.6.0/* hadoop/
# exit

Installing Hadoop

Install Hadoop in any of the required modes. Here, we are demonstrating HBase functionalities in pseudo-distributed mode, therefore install Hadoop in pseudo-distributed mode.

Follow the steps given below to install Hadoop 2.4.1 on your system.

Step 1: Setting up Hadoop

You can set Hadoop environment variables by appending the following commands to ~/.bashrc file.

export HADOOP_HOME=/usr/local/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME

export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native

export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_INSTALL=$HADOOP_HOME

Now, apply all changes into the currently running system.

$ source ~/.bashrc

Step 2: Hadoop Configuration

You can find all the Hadoop configuration files at the location “$HADOOP_HOME/etc/hadoop”. It is required to make changes in those configuration files according to your Hadoop infrastructure.

$ cd $HADOOP_HOME/etc/hadoop

In order to develop Hadoop programs in Java, you need to reset the Java environment variables in hadoop-env.sh file by replacing JAVA_HOME value with the location of Java in your system.

export JAVA_HOME=/usr/local/jdk1.7.0_71

Given below are the list of files which you have to edit to configure Hadoop.

core-site.xml

The core-site.xml file contains information such as the port number used for Hadoop instance, memory allocated for file system, memory limit for storing data, and the size of Read/Write buffers.

Open core-site.xml and add the following property in between the <configuration>, </configuration> tags:

<configuration>
   <property>
      <name>fs.default.name</name>
      <value>hdfs://localhost:9000</value>
   </property>
</configuration>

hdfs-site.xm

The hdfs-site.xml file contains information such as the value of replication data, namenode path, and datanode paths of your local file systems. It means the place where you want to store the Hadoop infrastructure.

Let us assume the following data:

dfs.replication (data replication value) = 1

(In the below given path /hadoop/ is the user name.
hadoopinfra/hdfs/namenode is the directory created by hdfs file system.)
namenode path = //home/hadoop/hadoopinfra/hdfs/namenode

(hadoopinfra/hdfs/datanode is the directory created by hdfs file system.)
datanode path = //home/hadoop/hadoopinfra/hdfs/datanode

Open this file and add the following properties in between the <configuration>, </configuration> tags in this file.

<configuration>
   <property>
      <name>dfs.replication</name>
      <value>1</value>
   </property>
	
   <property>
      <name>dfs.name.dir</name>
      <value>file:///home/hadoop/hadoopinfra/hdfs/namenode</value>
   </property>
	
   <property>
      <name>dfs.data.dir</name>
      <value>file:///home/hadoop/hadoopinfra/hdfs/datanode</value>
   </property>
</configuration>

Note: In the above file, all the property values are user defined. You can make changes according to your Hadoop infrastructure.

mapred-site.xml

This file is used to configure yarn into Hadoop. Open mapred-site.xml file and add the following property in between the <configuration>, </configuration> tags in this file.

<configuration>
   <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
   </property>
</configuration>

mapred-site.xml

This file is used to specify which MapReduce framework we are using. By default, Hadoop contains a template of mapred-site.xml. First of all, it is required to copy the file from mapred-site.xml.template to mapred-site.xml file using the following command.

$ cp mapred-site.xml.template mapred-site.xml

Open mapred-site.xml file and add the following properties in between the <configuration>, </configuration> tags in this file.

<configuration>
   <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
   </property>
</configuration>

Verifying Hadoop Installation

The following steps are used to verify the Hadoop installation.

Step 1: Name Node Setup

Set up the namenode using the command “hdfs namenode -format” as follows:

$ cd ~
$ hdfs namenode -format

The expected result is as follows:

10/24/14 21:30:55 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = localhost/192.168.1.11
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 2.4.1
...
...
10/24/14 21:30:56 INFO common.Storage: Storage directory
/home/hadoop/hadoopinfra/hdfs/namenode has been successfully formatted.
10/24/14 21:30:56 INFO namenode.NNStorageRetentionManager: Going to retain
1 images with txid >= 0
10/24/14 21:30:56 INFO util.ExitUtil: Exiting with status 0
10/24/14 21:30:56 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at localhost/192.168.1.11
************************************************************/

Step 2: Verifying Hadoop dfs

The following command is used to start dfs. This command starts your Hadoop file system.

$ start-dfs.sh

The expected output is as follows:

10/24/14 21:37:56
Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/hadoop/hadoop-
2.4.1/logs/hadoop-hadoop-namenode-localhost.out
localhost: starting datanode, logging to /home/hadoop/hadoop-
2.4.1/logs/hadoop-hadoop-datanode-localhost.out
Starting secondary namenodes [0.0.0.0]

Step 3: Verifying Yarn Script

The following command is used to start yarn script. Executing this command will start your yarn demons.

$ start-yarn.sh

The expected output is as follows:

starting yarn daemons
starting resource manager, logging to /home/hadoop/hadoop-2.4.1/logs/yarn-
hadoop-resourcemanager-localhost.out
localhost: starting node manager, logging to /home/hadoop/hadoop-
2.4.1/logs/yarn-hadoop-nodemanager-localhost.out

Step 4: Accessing Hadoop on Browser

The default port number to access hadoop is 50070. Use the following URL to get Hadoop services on your browser.

http://localhost:50070/

Accessing Hadoop

Step 5: Verify All Applications for Cluster

The default port number to access all application of cluster is 8088. Use the following URL to visit this service.

http://localhost:8088/

Applications for Cluster

Downloading Mahout

Mahout is available in the website Mahout. Download Mahout from the link provided in the website. Here is the screenshot of the website.

Downloading Mahout

Step 1

Download Apache mahout from the link http://mirror.nexcess.net/apache/mahout/ using the following command.

[Hadoop@localhost ~]$ wget
http://mirror.nexcess.net/apache/mahout/0.9/mahout-distribution-0.9.tar.gz

Then mahout-distribution-0.9.tar.gz will be downloaded in your system.

Step2

Browse through the folder where mahout-distribution-0.9.tar.gz is stored and extract the downloaded jar file as shown below.

[Hadoop@localhost ~]$ tar zxvf mahout-distribution-0.9.tar.gz

Maven Repository

Given below is the pom.xml to build Apache Mahout using Eclipse.

<dependency>
   <groupId>org.apache.mahout</groupId>
   <artifactId>mahout-core</artifactId>
   <version>0.9</version>
</dependency>

<dependency>
   <groupId>org.apache.mahout</groupId>
   <artifactId>mahout-math</artifactId>
   <version>${mahout.version}</version>
</dependency>

<dependency>
   <groupId>org.apache.mahout</groupId>
   <artifactId>mahout-integration</artifactId>
   <version>${mahout.version}</version>
</dependency>

This chapter covers the popular machine learning technique called recommendation, its mechanisms, and how to write an application implementing Mahout recommendation.

Recommendation

Ever wondered how Amazon comes up with a list of recommended items to draw your attention to a particular product that you might be interested in!

Suppose you want to purchase the book “Mahout in Action” from Amazon:

Mahout in Action

Along with the selected product, Amazon also displays a list of related recommended items, as shown below.

Items

Such recommendation lists are produced with the help of recommender engines. Mahout provides recommender engines of several types such as:

  • user-based recommenders,
  • item-based recommenders, and
  • several other algorithms.

Mahout Recommender Engine

Mahout has a non-distributed, non-Hadoop-based recommender engine. You should pass a text document having user preferences for items. And the output of this engine would be the estimated preferences of a particular user for other items.

Example

Consider a website that sells consumer goods such as mobiles, gadgets, and their accessories. If we want to implement the features of Mahout in such a site, then we can build a recommender engine. This engine analyzes past purchase data of the users and recommends new products based on that.

The components provided by Mahout to build a recommender engine are as follows:

  • DataModel
  • UserSimilarity
  • ItemSimilarity
  • UserNeighborhood
  • Recommender

From the data store, the data model is prepared and is passed as an input to the recommender engine. The Recommender engine generates the recommendations for a particular user. Given below is the architecture of recommender engine.

Architecture of Recommender Engine

Recommender Engine

Building a Recommender using Mahout

Here are the steps to develop a simple recommender:

Step1: Create DataModel Object

The constructor of PearsonCorrelationSimilarity class requires a data model object, which holds a file that contains the Users, Items, and Preferences details of a product. Here is the sample data model file:

1,00,1.0
1,01,2.0
1,02,5.0
1,03,5.0
1,04,5.0

2,00,1.0
2,01,2.0
2,05,5.0
2,06,4.5
2,02,5.0

3,01,2.5
3,02,5.0
3,03,4.0
3,04,3.0

4,00,5.0
4,01,5.0
4,02,5.0
4,03,0.0

The DataModel object requires the file object, which contains the path of the input file. Create the DataModel object as shown below.

DataModel datamodel = new FileDataModel(new File("input file"));

Step2: Create UserSimilarity Object

Create UserSimilarity object using PearsonCorrelationSimilarity class as shown below:

UserSimilarity similarity = new PearsonCorrelationSimilarity(datamodel);

Step3: Create UserNeighborhood object

This object computes a “neighborhood” of users like a given user. There are two types of neighborhoods:

  • NearestNUserNeighborhood – This class computes a neighborhood consisting of the nearest n users to a given user. “Nearest” is defined by the given UserSimilarity.
  • ThresholdUserNeighborhood – This class computes a neighborhood consisting of all the users whose similarity to the given user meets or exceeds a certain threshold. Similarity is defined by the given UserSimilarity.

Here we are using ThresholdUserNeighborhood and set the limit of preference to 3.0.

UserNeighborhood neighborhood = new ThresholdUserNeighborhood(3.0, similarity, model);

Step4: Create Recommender Object

Create UserbasedRecomender object. Pass all the above created objects to its constructor as shown below.

UserBasedRecommender recommender = new GenericUserBasedRecommender(model, neighborhood, similarity);

Step5: Recommend Items to a User

Recommend products to a user using the recommend() method of Recommender interface. This method requires two parameters. The first represents the user id of the user to whom we need to send the recommendations, and the second represents the number of recommendations to be sent. Here is the usage of recommender() method:

List<RecommendedItem> recommendations = recommender.recommend(2, 3);

for (RecommendedItem recommendation : recommendations) {
   System.out.println(recommendation);
 }

Example Program

Given below is an example program to set recommendation. Prepare the recommendations for the user with user id 2.

import java.io.File;
import java.util.List;

import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
import org.apache.mahout.cf.taste.impl.neighborhood.ThresholdUserNeighborhood;
import org.apache.mahout.cf.taste.impl.recommender.GenericUserBasedRecommender;
import org.apache.mahout.cf.taste.impl.similarity.PearsonCorrelationSimilarity;

import org.apache.mahout.cf.taste.model.DataModel;
import org.apache.mahout.cf.taste.neighborhood.UserNeighborhood;

import org.apache.mahout.cf.taste.recommender.RecommendedItem;
import org.apache.mahout.cf.taste.recommender.UserBasedRecommender;

import org.apache.mahout.cf.taste.similarity.UserSimilarity;

public class Recommender {
   public static void main(String args[]){
      try{
         //Creating data model
         DataModel datamodel = new FileDataModel(new File("data")); //data
      
         //Creating UserSimilarity object.
         UserSimilarity usersimilarity = new PearsonCorrelationSimilarity(datamodel);
      
         //Creating UserNeighbourHHood object.
         UserNeighborhood userneighborhood = new ThresholdUserNeighborhood(3.0, usersimilarity, datamodel);
      
         //Create UserRecomender
         UserBasedRecommender recommender = new GenericUserBasedRecommender(datamodel, userneighborhood, usersimilarity);
        
         List<RecommendedItem> recommendations = recommender.recommend(2, 3);
			
         for (RecommendedItem recommendation : recommendations) {
            System.out.println(recommendation);
         }
      
      }catch(Exception e){}
      
   }
  }

Compile the program using the following commands:

javac Recommender.java
java Recommender

It should produce the following output:

RecommendedItem [item:3, value:4.5]
RecommendedItem [item:4, value:4.0]

Applications of Clustering

  • Clustering is broadly used in many applications such as market research, pattern recognition, data analysis, and image processing.
  • Clustering can help marketers discover distinct groups in their customer basis. And they can characterize their customer groups based on purchasing patterns.
  • In the field of biology, it can be used to derive plant and animal taxonomies, categorize genes with similar functionality and gain insight into structures inherent in populations.
  • Clustering helps in identification of areas of similar land use in an earth observation database.
  • Clustering also helps in classifying documents on the web for information discovery.
  • Clustering is used in outlier detection applications such as detection of credit card fraud.
  • As a data mining function, Cluster Analysis serves as a tool to gain insight into the distribution of data to observe characteristics of each cluster.

Using Mahout, we can cluster a given set of data. The steps required are as follows:

  • Algorithm You need to select a suitable clustering algorithm to group the elements of a cluster.
  • Similarity and Dissimilarity You need to have a rule in place to verify the similarity between the newly encountered elements and the elements in the groups.
  • Stopping Condition A stopping condition is required to define the point where no clustering is required.

Procedure of Clustering

To cluster the given data you need to –

  • Start the Hadoop server. Create required directories for storing files in Hadoop File System. (Create directories for input file, sequence file, and clustered output in case of canopy).
  • Copy the input file to the Hadoop File system from Unix file system.
  • Prepare the sequence file from the input data.
  • Run any of the available clustering algorithms.
  • Get the clustered data.

Starting Hadoop

Mahout works with Hadoop, hence make sure that the Hadoop server is up and running.

$ cd HADOOP_HOME/bin
$ start-all.sh

Preparing Input File Directories

Create directories in the Hadoop file system to store the input file, sequence files, and clustered data using the following command:

$ hadoop fs -p mkdir /mahout_data
$ hadoop fs -p mkdir /clustered_data
$ hadoop fs -p mkdir /mahout_seq

You can verify whether the directory is created using the hadoop web interface in the following URL – http://localhost:50070/

It gives you the output as shown below:

Input Files Directories

Copying Input File to HDFS

Now, copy the input data file from the Linux file system to mahout_data directory in the Hadoop File System as shown below. Assume your input file is mydata.txt and it is in the /home/Hadoop/data/ directory.

$ hadoop fs -put /home/Hadoop/data/mydata.txt /mahout_data/

Preparing the Sequence File

Mahout provides you a utility to convert the given input file in to a sequence file format. This utility requires two parameters.

  • The input file directory where the original data resides.
  • The output file directory where the clustered data is to be stored.

Given below is the help prompt of mahout seqdirectory utility.

Step 1: Browse to the Mahout home directory. You can get help of the utility as shown below:

[Hadoop@localhost bin]$ ./mahout seqdirectory --help
Job-Specific Options:
--input (-i) input Path to job input directory.
--output (-o) output The directory pathname for output.
--overwrite (-ow) If present, overwrite the output directory

Generate the sequence file using the utility using the following syntax:

mahout seqdirectory -i <input file path> -o <output directory>

Example

mahout seqdirectory
-i hdfs://localhost:9000/mahout_seq/
-o hdfs://localhost:9000/clustered_data/

Clustering Algorithms

Mahout supports two main algorithms for clustering namely:

  • Canopy clustering
  • K-means clustering

Canopy Clustering

Canopy clustering is a simple and fast technique used by Mahout for clustering purpose. The objects will be treated as points in a plain space. This technique is often used as an initial step in other clustering techniques such as k-means clustering. You can run a Canopy job using the following syntax:

mahout canopy -i <input vectors directory>
-o <output directory>
-t1 <threshold value 1>
-t2 <threshold value 2>

Canopy job requires an input file directory with the sequence file and an output directory where the clustered data is to be stored.

Example

mahout canopy -i hdfs://localhost:9000/mahout_seq/mydata.seq
-o hdfs://localhost:9000/clustered_data
-t1 20
-t2 30 

You will get the clustered data generated in the given output directory.

K-means Clustering

K-means clustering is an important clustering algorithm. The k in k-means clustering algorithm represents the number of clusters the data is to be divided into. For example, the k value specified to this algorithm is selected as 3, the algorithm is going to divide the data into 3 clusters.

Each object will be represented as vector in space. Initially k points will be chosen by the algorithm randomly and treated as centers, every object closest to each center are clustered. There are several algorithms for the distance measure and the user should choose the required one.

Creating Vector Files

  • Unlike Canopy algorithm, the k-means algorithm requires vector files as input, therefore you have to create vector files.
  • To generate vector files from sequence file format, Mahout provides the seq2parse utility.

Given below are some of the options of seq2parse utility. Create vector files using these options.

$MAHOUT_HOME/bin/mahout seq2sparse
--analyzerName (-a) analyzerName  The class name of the analyzer
--chunkSize (-chunk) chunkSize    The chunkSize in MegaBytes.
--output (-o) output              The directory pathname for o/p
--input (-i) input                Path to job input directory.

After creating vectors, proceed with k-means algorithm. The syntax to run k-means job is as follows:

mahout kmeans -i <input vectors directory>
-c  <input clusters directory>
-o  <output working directory>
-dm <Distance Measure technique>
-x  <maximum number of iterations>
-k  <number of initial clusters>

K-means clustering job requires input vector directory, output clusters directory, distance measure, maximum number of iterations to be carried out, and an integer value representing the number of clusters the input data is to be divided into.

What is Classification?

Classification is a machine learning technique that uses known data to determine how the new data should be classified into a set of existing categories. For example,

  • iTunes application uses classification to prepare playlists.
  • Mail service providers such as Yahoo! and Gmail use this technique to decide whether a new mail should be classified as a spam. The categorization algorithm trains itself by analyzing user habits of marking certain mails as spams. Based on that, the classifier decides whether a future mail should be deposited in your inbox or in the spams folder.

How Classification Works

While classifying a given set of data, the classifier system performs the following actions:

  • Initially a new data model is prepared using any of the learning algorithms.
  • Then the prepared data model is tested.
  • Thereafter, this data model is used to evaluate the new data and to determine its class.

Classification Works

Applications of Classification

  • Credit card fraud detection – The Classification mechanism is used to predict credit card frauds. Using historical information of previous frauds, the classifier can predict which future transactions may turn into frauds.
  • Spam e-mails – Depending on the characteristics of previous spam mails, the classifier determines whether a newly encountered e-mail should be sent to the spam folder.

Naive Bayes Classifier

Mahout uses the Naive Bayes classifier algorithm. It uses two implementations:

  • Distributed Naive Bayes classification
  • Complementary Naive Bayes classification

Naive Bayes is a simple technique for constructing classifiers. It is not a single algorithm for training such classifiers, but a family of algorithms. A Bayes classifier constructs models to classify problem instances. These classifications are made using the available data.

An advantage of naive Bayes is that it only requires a small amount of training data to estimate the parameters necessary for classification.

For some types of probability models, naive Bayes classifiers can be trained very efficiently in a supervised learning setting.

Despite its oversimplified assumptions, naive Bayes classifiers have worked quite well in many complex real-world situations.

Procedure of Classification

The following steps are to be followed to implement Classification:

  • Generate example data
  • Create sequence files from data
  • Convert sequence files to vectors
  • Train the vectors
  • Test the vectors

Step1: Generate Example Data

Generate or download the data to be classified. For example, you can get the 20 newsgroups example data from the following link: http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz

Create a directory for storing input data. Download the example as shown below.

$ mkdir classification_example
$ cd classification_example
$tar xzvf 20news-bydate.tar.gz
wget http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz 

Step 2: Create Sequence Files

Create sequence file from the example using seqdirectory utility. The syntax to generate sequence is given below:

mahout seqdirectory -i <input file path> -o <output directory>

Step 3: Convert Sequence Files to Vectors

Create vector files from sequence files using seq2parse utility. The options of seq2parse utility are given below:

$MAHOUT_HOME/bin/mahout seq2sparse
--analyzerName (-a) analyzerName  The class name of the analyzer
--chunkSize (-chunk) chunkSize    The chunkSize in MegaBytes.
--output (-o) output              The directory pathname for o/p
--input (-i) input                Path to job input directory. 

Step 4: Train the Vectors

Train the generated vectors using the trainnb utility. The options to use trainnb utility are given below:

mahout trainnb
 -i ${PATH_TO_TFIDF_VECTORS}
 -el
 -o ${PATH_TO_MODEL}/model
 -li ${PATH_TO_MODEL}/labelindex
 -ow
 -c

Step 5: Test the Vectors

Test the vectors using testnb utility. The options to use testnb utility are given below:

mahout testnb
 -i ${PATH_TO_TFIDF_TEST_VECTORS}
 -m ${PATH_TO_MODEL}/model
 -l ${PATH_TO_MODEL}/labelindex
 -ow
 -o ${PATH_TO_OUTPUT}
 -c
 -seq
| Leave a comment

Teradata – New Technology RDBMS

Teradata is a popular Relational Database Management System (RDBMS) suitable for large data warehousing applications. It is capable of handling large volumes of data and is highly scalable

It is mainly suitable for building large scale data warehousing applications. Teradata achieves this by the concept of parallelism. It is developed by the company called Teradata.

Teradata Features :

 

  • Unlimited Parallelism − Teradata database system is based on Massively Parallel Processing (MPP) Architecture. MPP architecture divides the workload evenly across the entire system. Teradata system splits the task among its processes and runs them in parallel to ensure that the task is completed quickly.
  • Shared Nothing Architecture − Teradata’s architecture is called as Shared Nothing Architecture. Teradata Nodes, its Access Module Processors (AMPs) and the disks associated with AMPs work independently. They are not shared with others.
  • Linear Scalability − Teradata systems are highly scalable. They can scale up to 2048 Nodes. For example, you can double the capacity of the system by doubling the number of AMPs.
  • Connectivity − Teradata can connect to Channel-attached systems such as Mainframe or Network-attached systems.
  • Mature Optimizer − Teradata optimizer is one of the matured optimizer in the market. It has been designed to be parallel since its beginning. It has been refined for each release.
  • SQL − Teradata supports industry standard SQL to interact with the data stored in tables. In addition to this, it provides its own extension.
  • Robust Utilities − Teradata provides robust utilities to import/export data from/to Teradata system such as FastLoad, MultiLoad, FastExport and TPT.
  • Automatic Distribution − Teradata automatically distributes the data evenly to the disks without any manual intervention

    Components of Teradata

    The key components of Teradata are as follows −

    • Node − It is the basic unit in Teradata System. Each individual server in a Teradata system is referred as a Node. A node consists of its own operating system, CPU, memory, own copy of Teradata RDBMS software and disk space. A cabinet consists of one or more Nodes.
    • Parsing Engine − Parsing Engine is responsible for receiving queries from the client and preparing an efficient execution plan. The responsibilities of parsing engine are −
      • Receive the SQL query from the client
      • Parse the SQL query check for syntax errors
      • Check if the user has required privilege against the objects used in the SQL query
      • Check if the objects used in the SQL actually exists
      • Prepare the execution plan to execute the SQL query and pass it to BYNET
      • Receives the results from the AMPs and send to the client
    • Message Passing Layer − Message Passing Layer called as BYNET, is the networking layer in Teradata system. It allows the communication between PE and AMP and also between the nodes. It receives the execution plan from Parsing Engine and sends to AMP. Similarly, it receives the results from the AMPs and sends to Parsing Engine.
    • Access Module Processor (AMP) − AMPs, called as Virtual Processors (vprocs) are the one that actually stores and retrieves the data. AMPs receive the data and execution plan from Parsing Engine, performs any data type conversion, aggregation, filter, sorting and stores the data in the disks associated with them. Records from the tables are evenly distributed among the AMPs in the system. Each AMP is associated with a set of disks on which data is stored. Only that AMP can read/write data from the disks.

    Storage Architecture

    When the client runs queries to insert records, Parsing engine sends the records to BYNET. BYNET retrieves the records and sends the row to the target AMP. AMP stores these records on its disks. Following diagram shows the storage architecture of Teradata.

    Storage Architecture

    Retrieval Architecture

    When the client runs queries to retrieve records, the Parsing engine sends a request to BYNET. BYNET sends the retrieval request to appropriate AMPs. Then AMPs search their disks in parallel and identify the required records and sends to BYNET. BYNET then sends the records to Parsing Engine which in turn will send to the client. Following is the retrieval architecture of Teradata.

    Retrieval Architecture

Primary Key

Primary key is used to uniquely identify a row in a table. No duplicate values are allowed in a primary key column and they cannot accept NULL values. It is a mandatory field in a table.

Foreign Key

Foreign keys are used to build a relationship between the tables. A foreign key in a child table is defined as the primary key in the parent table. A table can have more than one foreign key. It can accept duplicate values and also null values. Foreign keys are optional in a table.

Table Types

Types Teradata supports different types of tables.

  • Permanent Table − This is the default table and it contains data inserted by the user and stores the data permanently.
  • Volatile Table − The data inserted into a volatile table is retained only during the user session. The table and data is dropped at the end of the session. These tables are mainly used to hold the intermediate data during data transformation.
  • Global Temporary Table − The definition of Global Temporary table are persistent but the data in the table is deleted at the end of user session.
  • Derived Table − Derived table holds the intermediate results in a query. Their lifetime is within the query in which they are created, used and dropped.

Set Versus Multiset

Teradata classifies the tables as SET or MULTISET tables based on how the duplicate records are handled. A table defined as SET table doesn’t store the duplicate records, whereas the MULTISET table can store duplicate records.

 

| Leave a comment

Why and How Google MapReduce Technology Works ?

Big Data is a collection of large datasets that cannot be processed using traditional computing techniques. For example, the volume of data Facebook or Youtube need require it to collect and manage on a daily basis, can fall under the category of Big Data. However, Big Data is not only about scale and volume, it also involves one or more of the following aspects − Velocity, Variety, Volume, and Complexity.

Traditional Enterprise Systems normally have a centralized server to store and process data. The following illustration depicts a schematic view of a traditional enterprise system. Traditional model is certainly not suitable to process huge volumes of scalable data and cannot be accommodated by standard database servers. Moreover, the centralized system creates too much of a bottleneck while processing multiple files simultaneously.
Traditional Enterprise System View

Google solved this bottleneck issue using an algorithm called MapReduce. MapReduce divides a task into small parts and assigns them to many computers. Later, the results are collected at one place and integrated to form the result dataset.

How MapReduce Works?

The MapReduce algorithm contains two important tasks, namely Map and Reduce.

The Map task takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key-value pairs).

The Reduce task takes the output from the Map as an input and combines those data tuples (key-value pairs) into a smaller set of tuples.

The reduce task is always performed after the map job.

Let us now take a close look at each of the phases and try to understand their significance.
Phases

Input Phase − Here we have a Record Reader that translates each record in an input file and sends the parsed data to the mapper in the form of key-value pairs.

Map − Map is a user-defined function, which takes a series of key-value pairs and processes each one of them to generate zero or more key-value pairs.

Intermediate Keys − They key-value pairs generated by the mapper are known as intermediate keys.

Combiner − A combiner is a type of local Reducer that groups similar data from the map phase into identifiable sets. It takes the intermediate keys from the mapper as input and applies a user-defined code to aggregate the values in a small scope of one mapper. It is not a part of the main MapReduce algorithm; it is optional.

Shuffle and Sort − The Reducer task starts with the Shuffle and Sort step. It downloads the grouped key-value pairs onto the local machine, where the Reducer is running. The individual key-value pairs are sorted by key into a larger data list. The data list groups the equivalent keys together so that their values can be iterated easily in the Reducer task.

Reducer − The Reducer takes the grouped key-value paired data as input and runs a Reducer function on each one of them. Here, the data can be aggregated, filtered, and combined in a number of ways, and it requires a wide range of processing. Once the execution is over, it gives zero or more key-value pairs to the final step.

Output Phase − In the output phase, we have an output formatter that translates the final key-value pairs from the Reducer function and writes them onto a file using a record writer.

Let us take a real-world example to comprehend the power of MapReduce. Twitter receives around 500 million tweets per day, which is nearly 3000 tweets per second. The following illustration shows how Tweeter manages its tweets with the help of MapReduce.

Tokenize − Tokenizes the tweets into maps of tokens and writes them as key-value pairs.

Filter − Filters unwanted words from the maps of tokens and writes the filtered maps as key-value pairs.

Count − Generates a token counter per word.

Aggregate Counters − Prepares an aggregate of similar counter values into small manageable units.

| Leave a comment

Apache Maven : Fast Automated Build and Deployment in Development

Apache Maven : Automated Build and Deployment

Maven is a project management and comprehension tool. Maven provides developers a complete build lifecycle framework. Development team can automate the project’s build infrastructure in almost no time as Maven uses a standard directory layout and a default build lifecycle.

In case of multiple development teams environment, Maven can set-up the way to work as per standards in a very short time. As most of the project setups are simple and reusable, Maven makes life of developer easy while creating reports, checks, build and testing automation setups.

Maven provides developers ways to manage following:

Builds

Documentation

Reporting

Dependencies

SCMs

Releases

Distribution

mailing list

To summarize, Maven simplifies and standardizes the project build process. It handles compilation, distribution, documentation, team collaboration and other tasks seamlessly. Maven increases reusability and takes care of most of build related tasks.

Maven History

Maven was originally designed to simplify building processes in Jakarta Turbine project. There were several projects and each project contained slightly different ANT build files. JARs were checked into CVS.

Apache group then developed Maven which can build multiple projects together, publish projects information, deploy projects, share JARs across several projects and help in collaboration of teams.
Maven Objective

Maven primary goal is to provide developer

A comprehensive model for projects which is reusable, maintainable, and easier to comprehend.

plugins or tools that interact with this declarative model.

Maven project structure and contents are declared in an xml file, pom.xml referred as Project Object Model (POM), which is the fundamental unit of the entire Maven system. Refer to Maven POM section for more detail.
Convention over Configuration

Maven uses Convention over Configuration which means developers are not required to create build process themselves.

Developers do not have to mention each and every configuration detail. Maven provides sensible default behavior for projects. When a Maven project is created, Maven creates default project structure. Developer is only required to place files accordingly and he/she need not to define any configuration in pom.xml.

As an example, following table shows the default values for project source code files, resource files and other configurations. Assuming, ${basedir} denotes the project location:
Item Default
source code ${basedir}/src/main/java
resources ${basedir}/src/main/resources
Tests ${basedir}/src/test
distributable JAR ${basedir}/target
Complied byte code ${basedir}/target/classes

In order to build the project, Maven provides developers options to mention life-cycle goals and project dependencies (that rely on Maven pluging capabilities and on its default conventions). Much of the project management and build related tasks are maintained by Maven plugins.

| Leave a comment