countByValue() And countByKey()



WordCount Example using countByValue()


countByValue():

****countByValue() converts result in a Map collection not a RDD****

data_split.countByValue()

defaultdict(<type 'int'>, {u'good': 2, u'hello': 2, u'morning': 2})

countByKey():

****Count the number of elements for each key.
It counts the value of RDD consisting of two components tuple for each distinct key. It actually counts the number of elements for each key and return the result to the master as lists of (key,count) pairs.****

Example-1:

rdd1 = sc.parallelize((("HR",5),("RD",4),("ADMIN",5),("SALES",4),("SER",6),("MAN",8)))

rdd1.collect()

[('HR', 5), ('RD', 4), ('ADMIN', 5), ('SALES', 4), ('SER', 6), ('MAN', 8)]

rdd1.countByKey().items()

[('SER', 1), ('HR', 1), ('SALES', 1), ('RD', 1), ('ADMIN', 1), ('MAN', 1)]

Example-2:

data = sc.parallelize(("hello", "hello", "how", "how", "are", "are", "you", "hi"))

data.map(lambda x:(x,1)).countByKey().items()

[('how', 2), ('you', 1), ('hi', 1), ('hello', 2), ('are', 2)]



No comments:

Post a Comment

Pages