CSCI312 Big Data Management
School of Computing & Information Technology Session: 2, April 2019
University of Wollongong Lecturer: Janusz R. Getta
CSCI312 Big Data Management
Published on 27 April 2020
The objectives of Assignment 2 include conceptual modelling of a data warehouse,
logical design of a data warehouse and implementation of a data warehouse as a
collection of internal tables and external tables in Hive, querying a data cube,
implementation of HBase table and querying and manipulating data in HBase table,
This assignment is due on Saturday, 16 May 2019, 7:00pm (sharp) Singaporean Time
This assignment is worth 30% of the total evaluation in the subject.
Only electronic submission through Moodle at:
will be accepted. All email submissions will be deleted and mark 0 (“zero”) will be
immediately granted for Assignment 2. A submission procedure is explained at the end of
Assignment 2 specification.
A policy regarding late submissions is included in the subject outline.
Only one submission of Assignment 2 is allowed and only one submission per student is
A submission marked by Moodle as “late” is always treated as a late submission no
matter how many seconds it is late.
A submission that contains an incorrect file attached is treated as a correct submission
with all consequences coming from the evaluation of the file attached.
All files left on Moodle in a state “Draft(not submitted)” will not be evaluated.
A submission of compressed files (zipped, gzipped, rared, tared, 7-zipped, lhzed, … etc)
is not allowed. The compressed files will not be evaluated.
The second assignment is an individual assignment and it is expected that all its tasks
will be solved individually without any cooperation with the other students. However,
it is allowed to declare in the submission comments that a particular component or task of
this assignment has been implemented in cooperation with another student. In such a case
evaluation of a task or component may be shared with another student. In all other cases
plagiarism will result in a FAIL grade being recorded for entire assignment. If you have
any doubts, questions, etc. please consult your lecturer or tutor during laboratory/tutorial
classes or over e-mail.
Task 1 (8 marks)
Conceptual modelling of a data warehouse
An objective of this task is to create a conceptual schema of a sample data warehouse
domain described below. Read and analyse the following specification of a data
A group of hospitals would like to create a data warehouse to store information about
the medical examinations performed on the patients and later on to analyse the
contents of a data warehouse. It is expected that the planned data warehouse will
contain historical information collected over a long period of time.
A data warehouse supposed to contain information about the medical examinations,
patients, medical staff conducting medical examinations and including nurses and
doctors, hospitals, cities and countries the patients come from and hospitals are located
at, and locations of medical examinations.
The patients are described by a medical insurance number, first name, last name, data of
birth and address. A medical insurance number identifies each patient. An address
consists of building number, street name, city, and country.
The medical examinations are performed in the hospitals. A hospital is identified by its
unique name, and described by city and country it is located at. A medical examination is
performed in a medical examination room. A room number, building level, and building
number describe a medical examination room.
A medical staff member is described be an employee number, first name, last name, and
data of birth. A medical staff member belongs to one of the hospitals. A medical staff
member is either a nurse or a doctor. Nurses are additionally described by qualifications
and doctors are described by a specialization.
A medical examination may involve an external medical expert whose role is to provide
an independent opinion. An external medical expert is described by a title, first name, las
name and a name of organization he/she belongs to.
A medical examination is described by date and time when performed and the
measurement of the parameters like weight, blood pressure, cholesterol levels, and the
others (please insert at least two other measurements of your choice).
A data warehouse must be designed such it would be possible to easily implement the
following classes of applications.
(1) Find the total number of medical examinations performed per year, per month and
per day, per patient, per nurse, per doctor, per external expert who participated in
the examinations, per hospital, per city, per country, etc. For example, it should be
to find the total number of medical examinations performed in each month of 2017
in each hospital.
(2) Find the aggregations of the measures obtained from the medical examinations per
year, per month and per day, per patient, per nurse, per doctor, per external expert
who participated in the examinations, per hospital, per city, per country, etc. For
example, it should be possible to find an average blood pressure of all patients older
than 70 per last 12 months.
To create a conceptual schema of a sample data warehouse domain follow the steps listed
Step 1 Find a fact entity, find the measures describing a fact entity.
Step 2 Find the dimensions.
Step 3 Find the hierarchies over the dimensions.
Step 4 Find the descriptions (attributes) of all entity types.
Step 5 Draw a conceptual schema.
To draw a conceptual schema, use a graphical notation explained to you in a presentation
11 Conceptual Data Warehouse Design. To draw your diagram, you must use UMLet
diagram drawing tool and apply a “Conceptual modelling” notation, Selection of a
drawing notation is available in the right upper corner of the main menu of UMLet
diagram drawing tool.
A file solution1.pdf with a drawing of a conceptual schema.
Task 2 (4 marks)
Logical modelling of a data warehouse
Consider the following conceptual schema of a data warehouse.
Perform a step of logical design to transform a conceptual schema given above into a
logical schema (star schema). Use UMLet diagram drawing tool and apply a “Logical
modelling” notation to draw a logical schema. Selection of a drawing notation is
available in the right upper corner of the main menu of UMLet. Save a diagram of logical
schema in a file solution2.uxf and export it to a file solution2.pdf.
A file solution2.pdf with a drawing of a logical schema.
Task 3 (6 marks)
Implementation of a data warehouse as a collection of external tables in Hive
Consider the following two-dimensional data cube.
The data cube contains information about that parts that can be shipped by the suppliers.
Download and unzip a file task3.zip. You should obtain a folder task3 with the
following files: part.tbl, supplier.tbl, partsupp.tbl.
Use an editor to examine the contents of *.tbl files. Note, that the contents of the files
can be loaded into the relational tables obtained from the transformation of the twodimensional data cube given above into the relational table PART, SUPPLIER, and
Transfer the files into HDFS.
Implement HQL script solution3.hql that creates the external tables obtained from
a step of logical design performed earlier. The external tables must overlap on the files
transferred to HDFS in the previous step. Note, that a header in each *.tbl file must be
removed before creating the external tables.
Include into solution3.hql script SELECT statements that any 5 rows from each
one of the external tables implemented in the previous step and the total number of rows
included in each table.
When ready, use a command line interface beeline to process a script
solution3.hql and to save a report from processing in a file solution3.rpt.
A file solution3.rpt with a report from processing of HQL script
Task 4 (4 marks)
Querying data cube
Download and unzip a file task4.zip. You should obtain a folder task4 with the
following files: dbcreate.hql, dbdrop.hql, partsupp.tbl, lineitem.tbl,
A file orders.tbl contains information about the orders submitted by the customers.
A file lineitem.tbl contains information about the items included in the orders. A
file partsupp.tbl contains information about the items and suppliers of items
included in the orders.
Open Terminal window and use cd command to navigate to a folder with the just
unzipped files. Start Hive Server 2 in the terminal window (remember to start Hadoop
first). When ready process a script file dbcreate.hql to create the internal relational
tables and to load data into the tables. You can use either beeline or SQL Developer.
A script dbdrop.hql can be used to drop the tables.
The relational tables PARTSUPP, LINEITEM, ORDERS implement a simple twodimensional data cube. The relational tables PARTSUPP and ORDERS implement the
dimensions of parts supplied by suppliers and orders. A relational table LINEITEM
implements a fact entity of a data cube.
Implement the following queries. In each case, try to find the most efficient
(1) For the order clerks (O_CLERK) Clerk#000000522, Clerk#000000154, find
the total number of ordered parts per order date (O_ORDERDATE), per supplier
(L_SUPPKEY), per order date and supplier (O_ORDEDATE, L_SUPPKEY), and the
total number of ordered parts.
(2) For the parts with the keys (L_PARTKEY) 7, 8,9 find the largest discount applied
(L_DISCOUNT) per part key (L_PARTKEY) and per part key and supplier key
(L_PARTKEY, L_SUPPKEY) and the largest discount applied at all.
(3) Find the smallest quantity (L_QUANTITY) per order year (O_ORDERDATE), and
order clerk (O_CLERK).
(4) Find the smallest tax (L_TAX) applied per supplier key (L_SUPPKEY).
(5) Find an average quantity (L_QUANTITY) and the largest quantity (L_QUANTITY)
per part key (L_PARTKEY), and per part key, and order year (L_PARTKEY,
O_ORDERDATE). Consider only parts with the keys 5, 6, 7, 8, and 9.
When ready, save your SELECT statements in a file solution4.hql. Then, process a
script file solution4.hql and save the results in a report solution4.rpt.
A file solution4.rpt that contains a report from processing of SELECT statements.
Task 5 (4 marks)
Implementation of HBase table
Implement as a single HBase table a database that contains information described by the
following conceptual schema.
(1) Create HBase script solution5.hb with HBase shell commands that create
HBase table and load sample data into the table. Load into the table information
about at least two parts and two suppliers and such that one supplier supplies one
part and the other supplier supplies 2 parts.
When ready use HBase shell to process a script file solution5.hb and to save a
report from processing in a file solution5.rpt.
A file solution5.rpt that contains a report from processing of solution5.hb
script with the statements that create HBase table and load sample data.
Task 6 (4 marks)
Querying and manipulating data in HBase table
Consider a conceptual schema given below. The schema represents a simple database
domain where students submit assignments and each submission consists of several files
and it is related to one subject.
Download a file task6.hb with HBase shell commands and use HBase shell to process
it. Processing of task6.hb creates HBase table task6 and loads some data into it.
Use HBase shell to implement the following queries and data manipulations on the
HBase table created in the previous step. Save the queries and data manipulations in a file
(1) Find all information about a student number 007, list one version per cell.
(2) Find all information about a submission of assignment 1 performed by a student 007
in a subject 312, list one version per cell.
(3) Find the first and the last names of all students, list one version per cell.
(4) Find all information about a student whose last name is Bond, list two versions per
(5) Delete a column family FILES.
(6) List information about all submissions performed by the students.
(7) Increase the total number of versions in each cell of a column family SUBJECT.
(8) Delete HBase table task6.
When ready, start HBase shell and process a script file solution6.hb with Hbase
command shell. When processing is completed copy the contents of Command window
with a listing from processing of the script and paste the results into a file
solution6.rpt. Save the file. When ready submit a file solution6.rpt.
A file solution6.rpt with a listing from processing of a script file
Submission of Assignment 2
Note, that you have only one submission. So, make it absolutely sure that you submit
the correct files with the correct contents. No other submission is possible !
Submit the files solution1.pdf, solution2.pdf, solution3.rpt,
solution4.rpt, solution5.rpt, and solution6.rpt through Moodle in
the following way:
(1) Access Moodle at http://moodle.uowplatform.edu.au/
(2) To login use a Login link located in the right upper corner the Web page or in
the middle of the bottom of the Web page
(3) When logged select a site ISIT312 (SP220) Big Data Management
(4) Scroll down to a section SUBMISSIONS
(5) Click at In this place you can submit the outcomes of your
work on the tasks included in Assignment 2 link.
(6) Click at a button Add Submission
(7) Move a file solution1.pdf into an area You can drag and drop
files here to add them. You can also use a link Add…
(8) Repeat step (7) for the remaining files solution2.pdf, solution3.rpt,
solution4.rpt, solution5.rpt, and solution6.rpt
(9) Click at a button Save changes
(10)Click at a button Submit assignment
(11)Click at the checkbox with a text attached: By checking this box, I
confirm that this submission is my own work, … in order to
confirm authorship of your submission.
(12)Click at a button Continue
End of specification