• Warning on AWS!

    Posted by Xin Cao Saturday 28 October 2017, 07:40:08 PM.

    Dear All,

    Please follow the lab instructions to work on AWS.

    Test everything on your local machine first. Run your program on AWS after you believe that your program can generate correct results and is efficient enough. 100 dollars should be enough for you to finish project 4.

    Remember to terminate the cluster after you finish your jobs!!! Otherwise you may need to pay the extra costs incurred.

    BTW, please help provide some feedback in myExperience. Any suggestions or comments would be greatly appreciated.



  • FAQs on Project 4

    Posted by Xin Cao Thursday 26 October 2017, 05:18:04 PM.

    Dear All,

    1. The screenshot submitted should contain the cluster information and the runtime, like the figure below:

    2. You CANNOT create a global variable to share the information across different mappers and reducers. That may work on your local machine, but it will fail in AWS. You need to use the Configuration object to pass the information from the main function to the mappers/reducers.

    3. You CANNOT use LSH to do this project, since that can only obtain approximate results. This project requires exact set similarity join results. Please follow the slides and the paper as introduced during the lecture.

    4. As mentioned in the lecture, please briefly describe your optimization techniques in a file named as "Optimization.pdf". I've already updated the project specification.

    BTW, you can check your marks of project 3 now. Contact as soon as possible if you have any question.



  • The last assignment released.

    Posted by Xin Cao Sunday 22 October 2017, 07:33:08 PM.

    Dear All,

    The last assignment was released, and you have two weeks to work on it.

    This is not a coding project, and you need to provide the answers in a pdf file. All the questions are from previous years' exams.

    BTW, the marks of project 2 were released. If you have any question about your mark, please contact me asap.

    We will have the last lecture tomorrow.



  • Plagiarism detected in Project 2

    Posted by Xin Cao Tuesday 17 October 2017, 03:17:14 PM, last modified Tuesday 17 October 2017, 04:31:31 PM.

    Dear All,

    As I mentioned yesterday, plagiarism was detected in Project 2.

    It is all right if you copied some codes from internet. I've already provided the example codes to you for reference.

    Here "plagiarism" means that you copied the other students' codes or you sent your codes to the others. We've found some very similar submissions.

    You need to come to my office or send me an email no later than tomorrow if you cheated in Project 2. We have different punishment levels. I have to solve this issue soon and release the marks this week.

    Kind regards,


  • Project 4 Releasd

    Posted by Xin Cao Monday 09 October 2017, 01:05:55 PM.

    Dear All,

    Project 4 requires some knowledge in today's lecture. I will explain more about this project during the lecture.

    You have three weeks to work on this project.

    This project is the most difficult one, and also the most time-consuming one. There is no new lab in this week, and the lab time is for you to do this project .The tutor will help you if you have some questions or meet some problems.



  • Project 3 sample input and output released

    Posted by Xin Cao Friday 06 October 2017, 05:06:10 PM.

    Dear All,

    My apology for the late release of the sample input and output.

    Please contact me if you meet any problem or have any question.

    Remember that the deadline is this Sunday.

    The marks of project 2 will be published by next weekend.



  • Project 3 Submission Deadline Extended

    Posted by Xin Cao Sunday 24 September 2017, 07:36:42 PM.

    Dear All,

    As next week is the recess week, someone may have some traveling plans, and thus I extend the submission deadline of Project 3 to 09:59:59 pm on 8th Oct, one week later. I hope this can help you better finish this project.

    You need to finish lab 7 first before working on project 3. You'd better finish the problems in lab 8 as well, which could help you be more familiar with Scala and the Spark transformation and action operations.

    In the first problem of Project 3, when the result key/values pairs have the same value, rank them according to the alphabetical order of the keys. I've updated the project specification file.

    The marks of Project 1 have been released, together with the data sets used in marking. If you have any questions please contact me, and I will ask the tutor to check your submission again.

    The Monday after the break week is a public holiday, and there is no lecture that day. We will need to have a lecture in week 13 in order to finish the teaching of all chapters.



  • Project 2 FAQs

    Posted by Xin Cao Saturday 16 September 2017, 12:14:35 AM.

    Dear All,

    1. You will NOT lose marks due to the precision problem. In the project specification, "double precision" means that you should use "double" to store the distances.

    2. Please delete the intermediate folders and files generated during the iterations. Someone may meet the "Wrong FS" problem. You can try the following method:

    Configuration conf = new Configuration();

    FileSystem fs = FileSystem.get(new URI("hdfs://localhost:9000"),conf);

    Then, you can use fs to delete the folders.

    3. "\t" display problem. The same character "\t" may be displayed differently in your output file. That is caused by your text editor. Your data format should be correct, if you only use "\t" as the separator.

    Remember that the submission deadline is this Sunday.



  • Today's lecture is canceled

    Posted by Xin Cao Monday 11 September 2017, 08:45:38 AM.

    Dear All,

    I am very sorry for the late notice.

    I am not able to talk due to the severe cough, and thus I have to cancel today's lecture. I thought I would get better today, and I didn't expect that my cough could last for so long.

    Chapter 8 - streaming data mining will be introduced next week. This week's lab is not affected. Please try to attend because it is very relevant to your third project.

    My apology again. Thanks for your understanding.

    Kind regards,


  • Project 2 Submission Deadline Extended

    Posted by Xin Cao Wednesday 06 September 2017, 05:52:03 PM.

    Hi All,

    The submission deadline of project 2 is extended to 09:59:59 pm on 17 Sep 2017. You have one more week to work on it.

    More sample input and output will be provided later this week.

    Consequently, the third project will be released on next Friday.

    Kind regards,


  • Project 2 Released

    Posted by Xin Cao Saturday 26 August 2017, 07:32:09 PM.

    Dear All,

    Project 2 is released now. You have two weeks to do this project.

    The solutions to the problems in Lab3 are published as well. Please feel free to contact me if you have any questions.



  • Virtual Machine Image Download

    Posted by Xin Cao Wednesday 26 July 2017, 09:33:06 PM.

    Dear All,

    The image can be downloaded at:

    1. Download and Install VirtualBox

    2. Download the zip file and uncompress it, and rename the file "xubuntu-disk.vmdk" as "xubuntu-disk2.vmdk"

    3. Open VirtualBox, File->Import Applicance

    4. Browse the image folder, select the "*.ovf" file

    5. The image will be imported to your computer, which may take 10 minutes

    comp9313 is used as both username and password. The hadoop installation path is the same as in the virtual machine on lab computers.

    Hadoop MapReduce and Eclipse+plugin have been installed and configured.

    The video recording of the first lecture is still not available now. I've contacted the IT service center. Hopefully there is no problem...


Back to top

COMP9313 17s2 (Big Data Management) is powered by WebCMS3
CRICOS Provider No. 00098G