Define the purpose of Pig and Hive and your understanding of the process and what it is used for in a Big Data environment.

Import a table into the Hadoop ecosystem through the use of Pig and Hive. Define the purpose of Pig and Hive and your understanding of the process and what it is used for in a Big Data environment. Be sure to include all supporting screenshots as necessary.

Step 1 – Reading

Read Chapter 5: MapReduce Details for Multimachine Clusters (in “Pro Hadoop”; books 24×7).

Step 2 – Task 3 – Processing data using Pig and Hive

Hive

**Please ensure that all services are running properly as needed in the Cloudera Manager before starting this task**

Open a terminal window and type the following:

$ nano employees (This will open a text editor with the filename of employees)

Enter the following table structure:

Mary, 7038121129, VA, 42000

Tim, 3145558989, TX, 86000

Bob, 3429091122, MN, 75500

Manisha, 7096664242, WV, 94200

Aditya, 2021119765, CA, 39000

Xinwuei, 4129098787, OH, 57600

Press CTRL+X and follow the prompts to save changes. Then enter the following commands:

$ hadoop fs –put employees

$ hive

Please enter the following commands in Hive:

Ø CREATE database databasename; (Where databasename can be ANY name)

Ø SHOW databases;

Ø CREATE table tablename (colname1 datatype, colname2 datatype, colname3, datatype, . . .) row format delimited fields terminated by ‘,’ ; (Where tablename can be ANY name, the datatype must be appropriate to the column data, and column name must reflect column contents)

Ø DESCRIBE tablename (where tablename is the name of the table you have just create)

Ø LOAD DATA INPATH ‘employees’ INTO TABLE tablename; (this will load the data into the table)

Ø SELECT * FROM tablename; (this will show the contents of the table with tablename)

Ø SELECT count(*) FROM tablename; (this will count the number of rows in the table with tablename)

Please discuss the process you have just completed and upload your results with related screenshots as needed.
Pig
In a terminal, use nano to create two text files named “file1” and “file2” with the following values respectively:

10,30,0 5,1,10

0,5,10 10,5,0

10,20,10 20,20,20

CTRL+X and follow the prompts. Then type the following:

$ hadoop fs –put file1

$ hadoop fs –put file2

$ pig

In pig, enter the following commands:

Ø A = LOAD ‘/user/cloudera/file1’ using PigStorage (‘,’) as (a1:int,a2:int,a3:int);

Ø Dump a;

Ø B = LOAD ‘/user/cloudera/file2’ using PigStorage (‘,’) as (b1:int,b2:int,b3:int);

Ø Dump b;

Ø Describe a;

Ø Describe b;

Ø C = UNION a,b;

Ø Dump c;

Ø SPLIT c INTO d IF $0==0, e IF $0==10;

Ø Dump d;

Ø Dump e;

Ø F = filter c by $1 > 5;

Ø Dump f;

Ø Illustrate f;

Ø G = GROUP c by $2;

Ø Dump g;

Please discuss the operations you have done using Pig and the resultant logical operations that you have done.

Step 3 – Task 3 – Report

Write a report (4-6 pages) includes:

Following APA standards cover page and table of content,
Short research report on other components of Hadoop platform: Hive and Pig.
Create a file and loading data in the file; include a document on your understanding of the process and purpose, along with supporting screen shots.
Use Hive and Pig and generate the result, along with supporting screen shots.