Define the purpose of Pig and Hive and your understanding of the process and what it is used for in a Big Data environment.

Import a table into the Hadoop ecosystem through the use of Pig and Hive. Define the purpose of Pig and Hive and your understanding of the process and what it is used for in a Big Data environment. Be sure to include all supporting screenshots as necessary. 

Step 1 – Reading

  1. Read Chapter 5: MapReduce Details for Multimachine Clusters (in “Pro Hadoop”; books 24×7).

 

Step 2 – Task 3 – Processing data using Pig and Hive

  1. Hive

**Please ensure that all services are running properly as needed in the Cloudera Manager before starting this task**

  1. Open a terminal window and type the following:

$ nano employees                         (This will open a text editor with the filename of employees)

 

Enter the following table structure:

Mary, 7038121129, VA, 42000

Tim, 3145558989, TX, 86000

Bob, 3429091122, MN, 75500

Manisha, 7096664242, WV, 94200

Aditya, 2021119765, CA, 39000

Xinwuei, 4129098787, OH, 57600

 

Press CTRL+X and follow the prompts to save changes. Then enter the following commands:

$ hadoop fs –put employees

$ hive

 

  1. Please enter the following commands in Hive:

Ø  CREATE database databasename;                        (Where databasename can be ANY name)

Ø  SHOW databases;

Ø  CREATE table tablename (colname1 datatype, colname2 datatype, colname3, datatype, . . .) row format delimited fields terminated by ‘,’ ;                            (Where tablename can be ANY name, the datatype must be appropriate to the column data, and column name must reflect column contents)

Ø  DESCRIBE tablename                           (where tablename is the name of the table you have just create)

Ø  LOAD DATA INPATH ‘employees’ INTO TABLE tablename;       (this will load the data into the table)

Ø  SELECT * FROM tablename;                 (this will show the contents of the table with tablename)

Ø  SELECT count(*) FROM tablename;       (this will count the number of rows in the table with tablename)

 

  1. Please discuss the process you have just completed and upload your results with related screenshots as needed.
  2. Pig
  3. In a terminal, use nano to create two text files named “file1” and “file2” with the following values respectively:

 

10,30,0                  5,1,10

0,5,10                    10,5,0

10,20,10                20,20,20

 

CTRL+X and follow the prompts. Then type the following:

$ hadoop fs –put file1

$ hadoop fs –put file2

$ pig

  1. In pig, enter the following commands:

Ø  A = LOAD ‘/user/cloudera/file1’ using PigStorage (‘,’) as (a1:int,a2:int,a3:int);

Ø  Dump a;

Ø  B = LOAD ‘/user/cloudera/file2’ using PigStorage (‘,’) as (b1:int,b2:int,b3:int);

Ø  Dump b;

Ø  Describe a;

Ø  Describe b;

Ø  C = UNION a,b;

Ø  Dump c;

Ø  SPLIT c INTO d IF $0==0, e IF $0==10;

Ø  Dump d;

Ø  Dump e;

Ø  F = filter c by $1 > 5;

Ø  Dump f;

Ø  Illustrate f;

Ø  G = GROUP c by $2;

Ø  Dump g;

  1. Please discuss the operations you have done using Pig and the resultant logical operations that you have done.

 

Step 3 – Task 3 – Report

 

Write a report (4-6 pages) includes:

  • Following APA standards cover page and table of content,
  • Short research report on other components of Hadoop platform: Hive and Pig.
  • Create a file and loading data in the file; include a document on your understanding of the process and purpose, along with supporting screen shots.
  • Use Hive and Pig and generate the result, along with supporting screen shots.