Dear All,
1. The solution of the last assignment has been released. I will try to release the marks by tomorrow. The tutors are still working on this.
2. The marks of Project 3 will be released later tonight. We consider both the local runtime and the AWS elapsed time when marking the efficiency. Note that it is possible that your code is fast but the efficiency mark is not high. It is because your algorithm is not scalable on large data set. For example, some of you broadcast the entire data set, which is not practical.
Most of you did not do this optimisation: avoid generating duplicated pairs, and compute the similarity on only one common element for each pair. The idea is that: for a pair of records sharing several elements, you only do the computation on the most infrequent element. Thus, on all the other elements, you can safely discard this pair and do not compute their similarity again. This can greatly improve the efficiency.
The additional two test cases used in the marking are also released. Please check your code using them first before sending emails to us.
3. The exam is closed book, and you are NOT allowed to bring any thing with you except the calculator. This course does NOT have a supplementary exam. You may not have enough time to answer all questions. Manage your time in a smart way.
Good luck!
Regards,
Xin
Dear All,
In question 2, please change h1(n) = 3n-1 to h1(n)=5n-1.
The previous hash function causes collision, and makes the problem complicated.
If you have already submitted, please update your answer and submit again.
I am sorry for the inconvenience.
Regards,
Xin
Dear All,
Below are some points I need to clarify:
1. The number of nodes in a cluster means the total number of nodes, including both the master and core nodes. This is to reduce your cost on this project.
2. When recording the run time, use the elapsed time excluding the time on setting up and terminating the cluster (from the step tab).
3. After rounding the similarities to six decimal places, please remove the trailing zeros. You can use the following code to do this: BigDecimal(similarity).setScale(6, BigDecimal.RoundingMode.HALF_UP).toDouble)
4. Remove "setMaster" from your code when submitting your solution. We will configure this in spark-submit during marking.
5. It is not guaranteed that the elements are sorted by frequency in the data sets. A new test data set is released for you to check the correctness of your solution.
The marks of project 2 have been released. The test cases used in marking were also released. If you have any question about your mark, please first test your code using these test data sets. If your results are correct, then contact me or the tutors.
The last assignment is also released. The submission deadline is next Sunday.
Finally, please remember to provide your valuable comments in myExperience when you have time. Thank you!
Regards,
Xin
Dear All,
I received several emails requesting a test case yesterday.
One test case for problem 1 has been released. Please use it to check the correctness of your code.
The marks of project 1 have been released as well. If you have any question, please contact me or the course admin asap. We can check your submission again.
Regards,
Xin
Dear All,
As said in the lecture, the submission deadline has been extended to Sep 30. You have one more week to work on it.
I also fixed some errors in the project specification as below:
1. Sort the results in format of (w_i, w_j, rf) first by w_i in ascending order, then by rf in descending order. If the relative frequencies are also the same, sort by w_j in ascending order.
2. Given the example graph, there are three cycles, and the output should be 3.
Regards,
Xin
Dear All,
One test data set is released for you to check the correctness of your program. Please follow the steps as below:
1. Start HDFS and YARN
2. Make sure your java version is 7 (check the configuration file ~/.bashrc)
3. Download the data set package and the script into a folder
4. Copy InvertedIndex.zip to this folder as well.
5. Make the script executable, i.e., chmod +x run.sh.
6. Run the script and wait for the result
7. Compare your result (part-r-00000) and the sample output (result) by "diff"
Previously, it is required to set the number of reducers to 3. Now, please make this number as a parameter of your program. That is, your program should receive three parameters: the input folder, the output folder, and the number of reducers.
If you have already submitted, please update your code and submit again. I am sorry for the inconvenience caused.
Regards,
Xin
Dear All,
You have three weeks on the project. The deadline is Sep 09, Sunday.
I suggest you finish next week's lab first and then begin to work on the project.
The efficiency of your code will be considered during the marking.
Regards,
Xin
Dear All,
Project 1 will be released after tomorrow's class, since it requires some knowledge in tomorrow's lecture. You will still have 3 weeks to work on this project.
Regards,
Xin
Dear Students in M18A,
I am planning to cancel this lab, since it conflicts with the deep learning course. There are only 8 students currently in this lab, and many of you also need to attend COMP9444.
There are still some vacancies in the other labs. Please come to the lab at another time slot.
Regards,
Xin