ISIT312 Big Data Management

ISIT312 Big Data Management
SIM S2 2020
Assignment 3
All files left on Moodle in a state “Draft(not submitted)” will not be evaluated. Please refer to
the submission dropbox on Moodle for the submission due date and time.
This assignment contributes to 20% of the total evaluation in the subject. This assignment
consists of 4 tasks. Specification of each task starts from a new page.
Task 1. Apache Pig (8 marks)
Task 2. Spark and HBase (4 marks)
Task 3. Spark and Hive (5 marks)
Task 4. Spark Streaming (3 marks)
All the tasks must be completed in the “BigDataVM” virtual machine used in ISIT312.
It is a requirement that all Laboratory and Assignment tasks in this subject must be solved
individually without any cooperation with the other students. If you have any doubts, questions,
etc. please consult your lecturer or tutor during lab classes or office hours. Plagiarism will result
in a FAIL grade being recorded for that assessment task.
Task 1. Data Queries in Pig Latin (8 marks)
Consider the above conceptual schema of a data warehouse. The data of the schema is stored
in the files customer.tbl, order_details.tbl, order.tbl, product.tbl and
salesperson.tbl, all of which are available in a “Resources” folder of Assignment 3 on
Moodle.
Note that each file has a header with information about the meanings of data in each column.
A header is not a data component of each file. Remove the headers before transferring the
files to HDFS
For each of the following queries on the data warehouse, implement and execute a script in
Pig Latin to return the correct result:
(1) Find the number of orders whose ship-city is London.
(2) Find the number of products that were not ordered in 1996.
(3) Find the order value (i.e., unit price multiplied by quantity of products per order) for
order IDs 10270 to 10279.
(4) Sort the salespersons by the total order value of orders they handled in a descending
order, and find the employee ID, first name and last name of the top three salespersons.
(5) Find the number of orders for products (i.e., product names) “Ikura” or “Tofu”.
Deliverables
Produce a report solution1.pdf which clearly documents the Pig scripts and output. Submit
the report.

SALESPERSON
employee-id ID
last-nme
first-name
title
birth-date
hire-date
notes
CUSTOMER
customer-id ID
customer-code
company-name
contact-name
contact-title
city
region
postal-code
country
phone
fax
PRODUCT
product-id ID
product-name
unit-price
units-in-stock
units-on-order
discontinuted
ORDER
order-id ID
order-date
ship-via
ship-city
ship-region
ship-postal-code
ship-country

ORDER-DETAIL
unit-price
quantity

discount

Task 2. Spark and HBase (4 marks)
The file 1902 is a weather record dataset collected from one station in U.S. in 1902. Each
record is a line in the ASCII format. The following shows one sample line with some of the
salient fields annotated.
0057
332130 # USAF weather station identifier
99999 # WBAN weather station identifier
19500101 # observation year, month and date

We are the Best!

course-preview

275 words per page

You essay will be 275 words per page. Tell your writer how many words you need, or the pages.


12 pt Times New Roman

Unless otherwise stated, we use 12pt Arial/Times New Roman as the font for your paper.


Double line spacing

Your essay will have double spaced text. View our sample essays.


Any citation style

APA, MLA, Chicago/Turabian, Harvard, our writers are experts at formatting.


We Accept

Secure Payment
Image 3