
  • The last notice

    Posted by Xin Cao Monday 12 November 2018, 07:48:25 PM, last modified Monday 12 November 2018, 11:09:45 PM.

    Dear All,

    1. The solution of the last assignment has been released. I will try to release the marks by tomorrow. The tutors are still working on this.

    2. The marks of Project 3 will be released later tonight. We consider both the local runtime and the AWS elapsed time when marking the efficiency. Note that it is possible that your code is fast but the efficiency mark is not high. It is because your algorithm is not scalable on large data set. For example, some of you broadcast the entire data set, which is not practical.

    Most of you did not do this optimisation: avoid generating duplicated pairs, and compute the similarity on only one common element for each pair. The idea is that: for a pair of records sharing several elements, you only do the computation on the most infrequent element. Thus, on all the other elements, you can safely discard this pair and do not compute their similarity again. This can greatly improve the efficiency.

    The additional two test cases used in the marking are also released. Please check your code using them first before sending emails to us.

    3. The exam is closed book, and you are NOT allowed to bring any thing with you except the calculator. This course does NOT have a supplementary exam. You may not have enough time to answer all questions. Manage your time in a smart way.

    Good luck!



  • Update of the Assignment

    Posted by Xin Cao Saturday 27 October 2018, 05:43:09 PM.

    Dear All,

    In question 2, please change h1(n) = 3n-1 to h1(n)=5n-1.

    The previous hash function causes collision, and makes the problem complicated.

    If you have already submitted, please update your answer and submit again.

    I am sorry for the inconvenience.



  • FAQ of Project 3

    Posted by Xin Cao Tuesday 23 October 2018, 10:09:28 PM.

    Dear All,

    Below are some points I need to clarify:

    1. The number of nodes in a cluster means the total number of nodes, including both the master and core nodes. This is to reduce your cost on this project.

    2. When recording the run time, use the elapsed time excluding the time on setting up and terminating the cluster (from the step tab).

    3. After rounding the similarities to six decimal places, please remove the trailing zeros. You can use the following code to do this: BigDecimal(similarity).setScale(6, BigDecimal.RoundingMode.HALF_UP).toDouble)

    4. Remove "setMaster" from your code when submitting your solution. We will configure this in spark-submit during marking.

    5. It is not guaranteed that the elements are sorted by frequency in the data sets. A new test data set is released for you to check the correctness of your solution.

    The marks of project 2 have been released. The test cases used in marking were also released. If you have any question about your mark, please first test your code using these test data sets. If your results are correct, then contact me or the tutors.

    The last assignment is also released. The submission deadline is next Sunday.

    Finally, please remember to provide your valuable comments in myExperience when you have time. Thank you!



  • Test data set of project2

    Posted by Xin Cao Friday 28 September 2018, 12:34:07 PM.

    Dear All,

    I received several emails requesting a test case yesterday.

    One test case for problem 1 has been released. Please use it to check the correctness of your code.

    The marks of project 1 have been released as well. If you have any question, please contact me or the course admin asap. We can check your submission again.



  • Updates of Project 2

    Posted by Xin Cao Sunday 16 September 2018, 10:40:20 PM.

    Dear All,

    As said in the lecture, the submission deadline has been extended to Sep 30. You have one more week to work on it.

    I also fixed some errors in the project specification as below:

    1. Sort the results in format of (w_i, w_j, rf) first by w_i in ascending order, then by rf in descending order. If the relative frequencies are also the same, sort by w_j in ascending order.

    2. Given the example graph, there are three cycles, and the output should be 3.



  • Test data set of project1

    Posted by Xin Cao Monday 03 September 2018, 11:39:32 AM.

    Dear All,

    One test data set is released for you to check the correctness of your program. Please follow the steps as below:

    1. Start HDFS and YARN

    2. Make sure your java version is 7 (check the configuration file ~/.bashrc)

    3. Download the data set package and the script into a folder

    4. Copy to this folder as well.

    5. Make the script executable, i.e., chmod +x

    6. Run the script and wait for the result

    7. Compare your result (part-r-00000) and the sample output (result) by "diff"

    Previously, it is required to set the number of reducers to 3. Now, please make this number as a parameter of your program. That is, your program should receive three parameters: the input folder, the output folder, and the number of reducers.

    If you have already submitted, please update your code and submit again. I am sorry for the inconvenience caused.


  • Project 1 has been released

    Posted by Xin Cao Saturday 18 August 2018, 10:24:03 PM.

    Dear All,

    You have three weeks on the project. The deadline is Sep 09, Sunday.

    I suggest you finish next week's lab first and then begin to work on the project.

    The efficiency of your code will be considered during the marking.



  • Release date of project 1

    Posted by Xin Cao Tuesday 14 August 2018, 12:46:54 PM.

    Dear All,

    Project 1 will be released after tomorrow's class, since it requires some knowledge in tomorrow's lecture. You will still have 3 weeks to work on this project.



  • Monday evening's lab (M18A)

    Posted by Xin Cao Sunday 05 August 2018, 05:51:58 PM.

    Dear Students in M18A,

    I am planning to cancel this lab, since it conflicts with the deep learning course. There are only 8 students currently in this lab, and many of you also need to attend COMP9444.

    There are still some vacancies in the other labs. Please come to the lab at another time slot.



Back to top

COMP9313 18s2 (Big Data Management) is powered by WebCMS3
CRICOS Provider No. 00098G