• The marks of project 4 and the assignment are released.

    Posted by Xin Cao Monday 26 June 2017, 01:17:15 AM.

    Dear All,

    We still have one week to submit the final marks to the school. Hence, do not worry if you have any questions about the marks. You can send me emails and we will double check your submission.

    Most of you have done excellently in the assignment and the projects. If you can pass the exam, you will absolutely pass the course.

    Good luck to your tomorrow's exam. Do your best~



  • Several Updates

    Posted by Xin Cao Thursday 08 June 2017, 07:56:14 PM.

    Deal All,

    1. The answers to all lab problems are released.

    2. The answers to the questions in the last assignment are released.

    3. The marks of project 3 are released. If you have any question please contact me as soon as possible.

    4. You can bring the UNSW approved calculators to the exam.

    5. Today is the last day that you can provide your valuable feedback to the course in the myExperience system. I appreciate a lot for those who have already done this. Please help finish the survey if you get some time tonight. I really hope that you could help me improve the teaching of this course. Thank you very much in advance for taking your time.



  • FAQs of project 4

    Posted by Xin Cao Friday 02 June 2017, 02:53:36 AM, last modified Tuesday 06 June 2017, 11:47:10 PM.

    Dear All,

    I've got a lot of emails recently regarding project 4. I summarized the questions as below:

    1. About stage 1 on sorting the element IDs.

    As I said in the class, you can skip stage 1 and assume that the element IDs are already sorted in the testing files.

    However, if you are able to finish stage 1 and PASS the test cases, you will get a 25% bonus (5 marks).

    2. How to cache the result of stage 1 in memory.

    The next stage requires reading the sorted lists of elements. If you create a public variable to store this information, your code may work on your local machine, but will fail when running on AWS.

    In Hadoop MapReduce, you can use the class DistributedCache to cache read-only files and thus the file can be accessible by every machine in the cluster. You can refer to this link for some examples:

    3. About the number of reducers.

    For the stages of computing similar pairs and removing duplicates, please use the number of reducers as specified in the parameter. If you decide to implement the first stage, you only need to use one reducer for this step.

    4. About the global ordering of the result.

    This project does not require a global ordering. It means that you only need to guarantee that the pairs are sorted within each output file. However, you can write a smart partitioner to achieve global ordering.

    Please note that how to distribute the computation affects the efficiency a lot. If you can make the distribution evenly, your program can benefit more from parallelization. Load balance is one of the most important research topic in distributed computation. You can try different partition methods to see the performance, if you have enough time.

    5. Screenshots of running on AWS.

    This is just used as a proof of the runtime of your program on different clusters. Please compress the screenshots and submit them with your program and the runtime figure. I've already updated the project specification.

    Good luck to you all~

    Kind regards,


Upcoming Due Dates

There is nothing due!

Back to top

COMP9313 17s1 (Big Data Management) is powered by WebCMS3
CRICOS Provider No. 00098G