Episode 76 : R Programming Questions and Answers – Version 11

16. What is clustering? What is the difference between kmeans clustering and hierarchical clustering?

Cluster is a group of objects that belongs to the same class. Clustering is the process of making a group of abstract objects into classes of similar objects.

Let us see why clustering is required in data analysis:

  • Scalability − We need highly scalable clustering algorithms to deal with large databases.
  • Ability to deal with different kinds of attributes − Algorithms should be capable of being applied to any kind of data such as interval-based (numerical) data, categorical, and binary data.
  • Discovery of clusters with attribute shape − The clustering algorithm should be capable of detecting clusters of arbitrary shape. They should not be bounded to only distance measures that tend to find spherical cluster of small sizes.
  • High dimensionality − The clustering algorithm should not only be able to handle low-dimensional data but also the high dimensional space.
  • Ability to deal with noisy data − Databases contain noisy, missing or erroneous data. Some algorithms are sensitive to such data and may lead to poor quality clusters.
  • Interpretability − The clustering results should be interpret-able, comprehensible, and usable.

K-MEANS clustering:

K-means clustering is a well known partitioning method. In this method objects are classified as belonging to one of K-groups. The results of partitioning method are a set of K clusters, each object of data set belonging to one cluster. In each cluster there may be a centroid or a cluster representative. In the case where we consider real-valued data, the arithmetic mean of the attribute vectors for all objects within a cluster provides an appropriate representative; alternative types of centroid may be required in other cases.

Example: A cluster of documents can be represented by a list of those keywords that occur in some minimum number of documents within a cluster. If the number of the clusters is large, the centroids can be further clustered to produce hierarchy within a dataset. K-means is a data mining algorithm which performs clustering of the data samples. In order to cluster the database, K-means algorithm uses an iterative approach.

R code

# Determine number of clusters

wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata,
centers=i)$withinss)
plot(1:15, wss, type=”b”, xlab=”Number of Clusters”,
ylab=”Within groups sum of squares”)

# K-Means Cluster Analysis

fit <- kmeans(mydata, 5) # 5 cluster solution

# get cluster means

aggregate(mydata,by=list(fit$cluster),FUN=mean)

# append cluster assignment

mydata <- data.frame(mydata, fit$cluster)

A robust version of K-means based on mediods can be invoked by using pam( ) instead of kmeans( ). The function pamk( ) in the fpc package is a wrapper for pam that also prints the suggested number of clusters based on optimum average silhouette width.

Hierarchical Clustering:

This method creates a hierarchical decomposition of the given set of data objects. We can classify hierarchical methods on the basis of how the hierarchical decomposition is formed. There are two approaches here:

  1. Agglomerative Approach
  2. Divisive Approach

Agglomerative Approach:

This approach is also known as the bottom-up approach. In this, we start with each object forming a separate group. It keeps on merging the objects or groups that are close to one another. It keeps on doing so until all of the groups are merged into one or until the termination condition holds.

Divisive Approach:

This approach is also known as the top-down approach. In this, we start with all of the objects in the same cluster. In the continuous iteration, a cluster is split up into smaller clusters. It is down until each object in one cluster or the termination condition holds. This method is rigid, i.e., once a merging or splitting is done, it can never be undone.

R code

Cars example

# The mtcars data set is built into R:

help(mtcars)

# We will focus on the variables that are continuous in nature rather than discrete:

cars.data <- mtcars[,c(1,3,4,5,6,7)]

# Standardizing by dividing through by the sample range of each variable

samp.range <- function(x){
myrange <- diff(range(x))
return(myrange)
}
my.ranges <- apply(cars.data,2,samp.range)
cars.std <- sweep(cars.data,2,my.ranges,FUN=”/”)

# Getting distance matrix:

Data Analytics with R Programming Certification Training

  • Instructor-led Sessions
  • Real-life Case 
  • Assignments
  • Lifetime Access

dist.cars <- dist(cars.std)

# Single linkage:

cars.single.link <- hclust(dist.cars, method=’single’)

# Plotting the single linkage dendrogram:

plclust(cars.single.link, labels=row.names(cars.data), ylab=”Distance”)

# Opening new window while keeping previous one open

windows()

# complete linkage:

cars.complete.link <- hclust(dist.cars, method=’complete’)

# Plotting the complete linkage dendrogram:

plclust(cars.complete.link, labels=row.names(cars.data), ylab=”Distance”)

# Average linkage:

cars.avg.link <- hclust(dist.cars, method=’average’)

# Plotting the average linkage dendrogram:

plclust(cars.avg.link, labels=row.names(cars.data), ylab=”Distance”)

# Average Linkage dendrogram seems to indicate two major clusters,

# Single Linkage dendrogram may indicate three.

# Single Linkage Solution:

cut.3 <- cutree(cars.single.link, k=3)

# printing the “clustering vector”

cut.3

cars.3.clust <- lapply(1:3, function(nc) row.names(cars.data)[cut.3==nc])

# printing the clusters in terms of the car names

cars.3.clust

# Cluster 1 seems to be mostly compact cars, Cluster 2 is sports cars, Cluster 3 is large Luxury sedans

17. Give examples of “rbind()” and “cbind()” functions in R

Cbind(): As the name suggests, it is used to bind two columns together. One fact to be kept in mind while binding two columns is, the number of rows in both the columns need to be same.

Let’s understand this with an example:

This is  “Marks” data-set which comprises of marks in three subjects-> 

marks-R Interview Questions-Edureka

We’ll bind this with a new dataset “Percentage” which consists of two columns :-> “Total” and “Percentage”

total-R Interview Questions-Edureka

Let’s combine the columns from these two data-sets using the “cbind()” function->

cbind(Marks,Percentage)

Since, the number of rows in both the data-sets is same we have combined the columns with the help of “cbind()” function

18. Give examples of while and for loop in R.

While loop:

while-R Interview Questions-Edureka

For loop:

19. Give examples of “select” and “filter” functions from “dplyr” package.

birth_weight_data_R Interview Questions-Edureka

Select: This function from “dplyr” package is used to select some specific columns from the data-set

Birth_weight %>% select(1,2,3)->birth
birth1_3-R Interview Questions-Edureka
Birth_weight %>% select(-5)->birth
birth-R Interview Questions-Edureka

Filter: This function from “dplyr” package is used to filter out some rows on the basis of a condition:

Birth_weight %>% filter(mother_age>35)->birth
birth35-R Interview Questions-Edureka
Birth_weight %>% filter(baby_wt>125 & smoke=="smoker")->birth
birthsmoke-R Interview Questions-Edureka

20. What is the use of stringR package. Give some examples of the functions in Stringr.

Some functions in StringR:

Initial:

fruit->

fruit1-R Interview Questions-Edureka
  • Converting the strings into capital:
str_to_upper(fruit)

  • Finding the count of number of letters:
str_count(fruit)

About ashokabhat

I am a C,C ,JAVA,Adobe Flex,.NET Programmer Currently working as a Software Developer
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a comment