Dear All,
1. The solution of the last assignment has been released. I will try to release the marks by tomorrow. The tutors are still working on this.
2. The marks of Project 3 will be released later tonight. We consider both the local runtime and the AWS elapsed time when marking the efficiency. Note that it is possible that your code is fast but the efficiency mark is not high. It is because your algorithm is not scalable on large data set. For example, some of you broadcast the entire data set, which is not practical.
Most of you did not do this optimisation: avoid generating duplicated pairs, and compute the similarity on only one common element for each pair. The idea is that: for a pair of records sharing several elements, you only do the computation on the most infrequent element. Thus, on all the other elements, you can safely discard this pair and do not compute their similarity again. This can greatly improve the efficiency.
The additional two test cases used in the marking are also released. Please check your code using them first before sending emails to us.
3. The exam is closed book, and you are NOT allowed to bring any thing with you except the calculator. This course does NOT have a supplementary exam. You may not have enough time to answer all questions. Manage your time in a smart way.
Good luck!
Regards,
Xin
Dear All,
In question 2, please change h1(n) = 3n-1 to h1(n)=5n-1.
The previous hash function causes collision, and makes the problem complicated.
If you have already submitted, please update your answer and submit again.
I am sorry for the inconvenience.
Regards,
Xin
Dear All,
Below are some points I need to clarify:
1. The number of nodes in a cluster means the total number of nodes, including both the master and core nodes. This is to reduce your cost on this project.
2. When recording the run time, use the elapsed time excluding the time on setting up and terminating the cluster (from the step tab).
3. After rounding the similarities to six decimal places, please remove the trailing zeros. You can use the following code to do this: BigDecimal(similarity).setScale(6, BigDecimal.RoundingMode.HALF_UP).toDouble)
4. Remove "setMaster" from your code when submitting your solution. We will configure this in spark-submit during marking.
5. It is not guaranteed that the elements are sorted by frequency in the data sets. A new test data set is released for you to check the correctness of your solution.
The marks of project 2 have been released. The test cases used in marking were also released. If you have any question about your mark, please first test your code using these test data sets. If your results are correct, then contact me or the tutors.
The last assignment is also released. The submission deadline is next Sunday.
Finally, please remember to provide your valuable comments in myExperience when you have time. Thank you!
Regards,
Xin