What is Apache Pig?
Apache Pig is a tool/platform used to process larger datasets. It is a data-flow language. We can perform various data manipulation using pig.
It is generally used with the Hadoop framework. For data analysis and processing Pig provides high-level language pig-latin. We can write pig scripts using pig-latin language for our requirements.
Need for Pig
Hadoop already provides data processing using MapReduce programming, then why do we need Pig?
We need Pig because :
MapReduce programs need to be written in Java and everyone cannot write programs in Java.
Programs can be complex in nature whereas Pig scripts are simple.
A simple MapReduce programming takes around 50 lines of code which can be done using 2-3 lines in Pig.
It has powerful built-in operators and functions.
Pig Eval functions:
Pig provides various built-in eval functions for analyzing data and data processing.
Various eval functions are:
- COUNT()
- AVG()
- SUM()
- CONCAT()
- DIFF()
- SUBTRACT()
- MAX()
- MIN()
- SIZE()
- TOKENIZE()
Examples:
Consider Student.txt
1,Alex,21,1234567890,Hyd,89
2,Sachin,22,9876543210,Kol,78
3,Raju,22,9898767654,Del,90
4,Christa,21,7689987654,Pun,93
5,Pam,23,2345678990,Bhu,75
6,Angela,23,1212121212,Che,87
7,Kamla,24,9434345256,Tri,83
8,Kumal,24,1223344556,Che,72
AVG():
Student_data = LOAD โStudent.txtโ USING PigStorage (โ,โ) AS (Stud_id:int,Stud_name:chararray,Age:int,Mobile:long,City:chararray,Percentage:int)
student_group = Group Student_data BY Stud_id;
student_Percentage_Avg = FOREACH student_group GENARATE group,AVG(student_data.Percentage);
DUMP student_Percentage_Avg;
COUNT():
student_count = FOREACH student_group GENARATE group,COUNT(student_data.Stud_id);
DUMP student_count;
CONCAT():
student_concat = foreach student_details Generate CONCAT (Stud_name, City);
DUMP student_concat;
Similarly you can perform all eval functions.
How to perform eval functions on Tuple?
You can not directly perform eval function on tuple data.
For performing eval functions on tuple you have to use the FLATTEN() function.
Example:
Consider Student.txt
1,Alex,21,1234567890,Hyd,(89,90,91)
2,Sachin,22,9876543210,Kol,(78,77,76)
3,Raju,22,9898767654,Del,(90,89,91)
4,Christa,21,7689987654,Pun,(93,83,73)
5,Pam,23,2345678990,Bhu,(75,77,79)
6,Angela,23,1212121212,Che,(87,93,90)
7,Kamla,24,9434345256,Tri,(83,81,85)
8,Kumal,24,1223344556,Che,(72,73,74)
Student_data = LOAD โStudent.txtโ USING PigStorage (โ,โ) AS (Stud_id:int,Stud_name:chararray,Age:int,Mobile:long,City:chararray,Marks:tuple(m1:int,m2:int,m3:int))
Let Problem states that you have to find total marks of each students.
Student_flatten = FOREACH Student_data GENERATE Stud_id,Stud_name,Age,Mobile,City,FLATTEN(Marks);
Student_group = GROUP Student_flatten BY Stud_id;
Student_total_marks = FOREACH Student_group GENERATE Stud_id,Stud_name,Age,Mobile,City,SUM($5,$6,$7) as Total_Marks;
DUMP Student_total_marks;
Here $5,$6$7 is used to get 6,7,8 fields respectively. Fields starts with $0.
For find average of marks obtained by each student.
Student_avg_marks = FOREACH Student_group GENERATE Stud_id,Stud_name,Age,Mobile,City,AVG($5,$6,$7) as Avg_Marks;
OR
Student_avg_marks = FOREACH Student_group GENERATE Stud_id,Stud_name,Age,Mobile,City,SUM($5,$6,$7)/3 as Avg_Marks;
DUMP Student_avg_marks;