Notices

  • School's decision on exam plagiarism

    Posted by Xin Cao Tuesday 06 September 2022, 08:23:00 AM.

    To those who obtained a WD grade,

    I am just notified by the school about how to deal with the exam plagiarism.

    For students who are in the final term at UNSW, you will be eligible to sit the supplementary exam this Thursday. However, your mark will be capped at 50 even if you can pass the course.

    For the other students, you will receive an allegation of plagiarism, and you can provide your response within 5 business days.

    I am confirming with the Nucleus on the list of students in the last term at UNSW, and such students will receive a confirmation email from me no later than tomorrow.

    Regards,

    Xin

  • Supplementary Exam Date and Time

    Posted by Xin Cao Friday 02 September 2022, 09:28:03 AM, last modified Friday 02 September 2022, 01:15:14 PM.

    The supplementary exam will take place next Thursday (8th September), and the time is again from 1pm to 5pm.

    I will email the exam paper to you if you will sit the supplementary exam.

    I will also send an email one day before the exam date to confirm your attendance.

    I have sent an email to all students who will sit the supplementary exam. Currently, there are only 10 students on the list. If you haven't received the email, it means that you are not eligible for that. If your current mark is WD, you still need to wait for the school's decision now.

    Regards,

    Xin

  • Serious Plagiarism Behaviors detected in the Final Exam

    Posted by Xin Cao Monday 29 August 2022, 10:27:54 AM, last modified Friday 02 September 2022, 06:44:12 PM.

    Dear All,

    We finished the marking of the final exam over the weekend.

    The marking took me and the teaching team much longer time than expected because we found many similar exam scripts. Some of them are almost the same on all 6 exam questions!! This is very disappointing to see such results.

    All such exam scripts have already been reported to the school. It is my first time detecting so serious plagiarism behaviors in the final exam of COMP9313. We cannot invigilate the online exam, but it does not mean that we can do nothing during the marking.

    "WD" will be displayed as the marks for these exam scripts, and the school will make a decision later after checking them.

    Before sending me the email to argue about the exam marks, please first think about if you have done properly in the exam.

    Regards,

    Xin


  • Marks of Project 3 Released

    Posted by Xin Cao Sunday 28 August 2022, 01:35:35 AM.

    Dear All,

    We have just released the marks of project 3.

    As mentioned previously, the efficiency mark is based on ranking. However, we also check your code and make sure that you've used the correct optimization techniques. For example, it is not acceptable to broadcast the entire dataset, since such a method does not work on large-scale datasets. So, it is possible that you have a low efficiency mark even though you feel that your program runs very fast.

    We need to finalize the marks on Monday. Therefore, if you have any doubts about your mark, please contact us (the corresponding marking tutor) on Sunday. We will not be able to update your marks after Monday.

    Regards,

    Xin

  • Errata and FAQs of the exam paper

    Posted by Xin Cao Tuesday 23 August 2022, 01:31:01 PM, last modified Tuesday 23 August 2022, 05:37:13 PM.

    1. In Question 2.(b), the correct sample output should be:

    2. In Question 6, it should be "four users u 1 , u 2 , u 3 and u 4 , and five movies m 1 , m 2 , m 3 , m 4 , and m 5 ", not three movies.

    3. In Question 3.(b), it seems that the description is not clear enough. I use an example as below to explain the question:

    Assume the list is (1, 2), (3, 4), (3, 5), (1, 99), (3, 110).

    The min value for key 1 is 2, the max value for key 1 is 99, and the gap is 97.

    The min value for key 3 is 4, the max value for key 3 is 110, and the gap is 106.

    Thus, 1 will be filtered out and 3 should be included in the result.

    4. In Question 6, when selecting the top-k similar users/items, use all the possible users/items in the dataset.

    5. In Question 3.(a), it does not matter if you change all characters to lower case or not. There are no marks on this point.

    6. The second Question 3.(b) should has question number 3.(c).

    7. When submitting your pdf file to Moodle, please put your zID at the beginning of the file name. If the file is large, please submit it earlier just in case there is some network issue.

    8. The submission due time is approaching. Considering that it is an online exam, the file submission may take some time, and someone may have network issues, 20 more minutes are given to you. Please utilize the last 20 minutes to check your answers and to prepare the file you need to submit.

    9. If your submission is a few seconds late, that is fine. No need to send me an email to ask about this.

  • Solutions to Lab 6 problems using DataFrame APIs

    Posted by Xin Cao Monday 22 August 2022, 08:10:50 PM.

    Dear All,

    A reminder that the final exam will take place from 1pm to 5pm tomorrow.

    The solutions to Lab 6 problems using DataFrame APIs are provided now. Since the DataFrame APIs were not introduced in detail due to my sickness, I have reduced the weight and difficulty of the DataFrame question. You do not need to worry much about that. Please read the provided solution codes to understand what are the differences between RDD APIs and DataFrame APIs.

    You can either handwrite your answer or type it on your computer for the final exam. Finally, you just need to submit one pdf file to Moodle including all your answers to the exam questions.

    During the exam, if you have any questions, please send me an email. I will reply to you as soon as possible.

    Regards,

    Xin

  • Exam time (Aug 23: 1pm - 5pm)

    Posted by Xin Cao Thursday 18 August 2022, 12:28:08 PM, last modified Friday 19 August 2022, 09:31:59 PM.

    Dear All,

    Just a reminder that the final exam time is from 1pm to 5pm in the afternoon of 23rd Aug. The exam paper will be released on WebCMS3 at 1pm, and you need to submit a pdf file containing your answers in Moodle before 5pm.

    Please begin to prepare for the final exam soon after you submit project 3.

    The last lecture provides all the information about the exam and some previous years' exam questions. Please read the slides and watch the recording of that lecture if you haven't done this.

    The guest lecture link can be found here: https://au-lti.bbcollab.com/recording/fabfbf2365064710aaed83c0e6913a5c . You can watch it if you are interested in the topic.

    The solution for project 2 has been released. Please compare your method with the given solution.

    Remember that Google Dataproc is optional in project 3. Please focus on your local running time before the submission. You can keep working on Dataproc after the deadline or after the exam. We are still happy to provide more assistance.

    I forgot to mention one thing: the number of pairs on the entire ABC news dataset with 0.8 as threshold is 1,294,913. Please check if your program can obtain the same result.

    Regards,

    Xin

  • Deadline of Project 3 Extended

    Posted by Xin Cao Wednesday 10 August 2022, 12:26:58 PM.

    Dear All,

    I have received several emails requesting an extension on project 3. I know that most of you have some final exams the next week. Considering this, I would like to extend the deadline of project 3 to midnight of next Friday. You will have 5 more days to work on it. In addition, using Google Dataproc is optional now (it seems that most of you did not read the previous notice in WebCMS3). This could also save you some time. However, it is still strongly recommended that you test your solution on three clusters in Dataproc.

    Since the deadline is extended, we will also have one more consultation next Wednesday at 11 am. If you have more questions regarding the project and the final exam, please join the session.

    It seems that some of you meet some problems with using DataFrame APIs. I will release the solutions to problems in Lab6 using DataFrame APIs later this week. I hope they could be helpful for you to complete project 3. Note that it is totally fine of using only RDD APIs in this project. If you feel that you are not familiar with DataFrame, you can stick to RDD, and you will not lose marks on this. The marks of project 2 will be released within this week as well.

    Finally, please kindly provide some constructive feedback on this course and my teaching. Thank you very much.

    Regards,

    Xin

  • Reminder: Guest lecture at 2pm today.

    Posted by Xin Cao Thursday 04 August 2022, 11:42:13 AM.

    Dear All,

    We finished all the lectures yesterday. We will still have one more consultation session at 11am next Wednesday.

    This is just a reminder that we will have a guest lecture at 2pm today. The lecturer will introduce big data management techniques in Amazon. Please join the session if you are interested: https://au.bbcollab.com/guest/f14def40113a476195aadb18a771bd6e .

    Two more test cases on project 3 have been released to help you debug and test your solution. If you haven't worked on the project, please start as soon as possible. Please watch the recording of the session on project 3 in Moodle first.

    The school has replied to me that we cannot get support on Google Dataproc this term. Thus, the Google Dataproc requirement is optional in project 3. If you cannot access Google Cloud, you only need to submit two files: SimilarNews.scala and Optimization.pdf. We will mark your submissions based on the local running time. However, it is strongly suggested that you test your solution on Google Dataproc and see how to improve the performance of distributed computation using a real cluster.

    Regards,

    Xin

  • Regarding Google Dataproc Account

    Posted by Xin Cao Monday 01 August 2022, 03:39:42 PM.

    Dear All,

    It seems that most students currently in China cannot register with Google and get the $300 credit for using Dataproc services.

    Some onshore students also reported to me that they have used Google Could before and cannot get the credit again.

    I have contacted the school to see if it is possible to get some financial support. I am still waiting for the reply.

    Please work on project 3 locally first. If the problem cannot be solved finally, we will remove the Dataproc requirement from project 3 for the students who cannot access Dataproc, and will mainly check the efficiency based on the local running time to guarantee fairness.

    BTW, I have invited a friend from Amazon in Sydney to deliver a guest lecture about big data management in Amazon from 2pm to 3pm this Thursday. Please join the lecture if you are interested via this link: https://au.bbcollab.com/guest/f14def40113a476195aadb18a771bd6e .

    Regards,

    Xin

  • One more session on explaining project 3

    Posted by Xin Cao Wednesday 27 July 2022, 11:09:33 AM.

    Dear All,

    I am really sorry that I forgot to click "Recording" in Collaborate for today's lecture. Please watch the last year's recording to learn about LSH (only the first hour): https://au-lti.bbcollab.com/recording/384f63fa421c45808c2ed9ca713c775a .

    I also explained the project today. I think this is very important to most of you. Therefore, I will do one more session on project 3 tomorrow at 11 am. If you would like to know more hints about the project, or you have some questions regarding the project, please join the session tomorrow.

    Please let me know if you cannot get the free credits in Google Dataproc by this week. I will see what might be a solution for you.

    Regards,

    Xin

  • More test cases for project 2 released

    Posted by Xin Cao Wednesday 20 July 2022, 02:31:35 AM.

    Dear All,

    We have released more test cases for project 2. Please use them to debug your code. The commands for running your code are also given with the test cases. Please note that project 2 requires you to use the RDD API for both problems. For the second problem, you must use the pregel operator.

    The solutions to the problems in Lab 7 have also been released. Please compare your methods with the provided solutions.

    Remember that we will have a consultation session on Wednesday.

    Regards,

    Xin

  • Project 1 marks released

    Posted by Xin Cao Tuesday 19 July 2022, 05:25:50 AM, last modified Tuesday 19 July 2022, 05:28:25 AM.

    Dear All,

    As you may already notice, the marks of project 1 have been released in Moodle. Besides the marks, the tutors also provided some feedback to you. If you have any doubt about your mark, please contact the tutor who marked your submission and cc the email to me. We will check the problem for you as soon as possible.

    Please note that we run your code on Hadoop using the command provided in the project specification. Before contacting the tutors, please make sure that you can obtain the correct results by running in Hadoop. In addition, if you have been granted an extension, but received a late submission penalty, please also contact us. We will update your mark accordingly.

    I am really sorry that I still cannot teach for two hours in this week. My throat still hurts a lot, even though I've taken the medicines given by my GP. However, I can speak with a low voice now. I think I will be able to do the consultation on Wednesday this week. I will explain the solution to project 1, and also briefly talk about project 2 in that session. See you on Wednesday, if you would like to attend the consultation session.

    Since project 3 is relevant to finding similar pairs, I will talk about this topic in the next week. Project 3 will be released this Sunday. This week's topic has been changed to streaming data mining. For this week's lectures, please watch the videos at the following links:

    https://au-lti.bbcollab.com/recording/e0db87c24f98463392aa9010d04289f7

    https://au-lti.bbcollab.com/recording/64d7f88fcb0f41248d7b70501792aa87

    It is unfortunate that I am the only person in CSE who can teach the big data course. The school cannot find an alternative person who can help during the period when I am sick... My apologies again for all the inconvenience caused.

    Regards,

    Xin

  • Lab 7 released earlier

    Posted by Xin Cao Wednesday 13 July 2022, 11:06:15 PM.

    Dear All,

    In order to complete Project 2, it is suggested to work on the problems first in labs 5 - 7.

    The solutions to the problems in lab 6 have been released. If you haven't worked on lab 6, please try to solve the problems on your own first, and then compare with the solutions.

    Lab 7 is also released which is about GraphX. The solutions will be released next Wednesday.

    In addition, if you would like to work in an IDE for Spark programming, you can refer to the guide at: https://webcms3.cse.unsw.edu.au/COMP9313/22T2/resources/76399 . You can download the Scala-IDE (based on Eclipse) to code and debug your program.

    Try your best to submit your project solutions before the deadline (next Sunday). If you have difficulties, please submit a special consideration request and contact me.

    Regards,

    Xin

  • Still cannot teach due to my sickness

    Posted by Xin Cao Tuesday 12 July 2022, 12:06:34 AM.

    Dear All,

    I am really sorry that I still cannot teach in this week due to my sickness. Besides coughing, my throat hurts a lot, which makes me not able to talk. This week's QA session will also be canceled. I will answer your questions in WebCMS3.

    Again, please watch the previous years' videos to learn NoSQL, HBase, and Hive:

    https://au-lti.bbcollab.com/recording/d30ac99557e6410592633afc792c201c

    https://au-lti.bbcollab.com/recording/c6b9369e36024f6d89d0e441a5fbe08e

    Note that this week's topic is optional and will not be covered in the final exam. If you do not have enough time, you can focus on the second project first. More test cases for the two problems will be provided within this week.

    Regards,

    Xin

  • The Due Date of Project 2 Extended

    Posted by Xin Cao Thursday 07 July 2022, 11:21:43 PM.

    Dear All,

    Considering that you have not practiced enough on Spark RDD programming, you are given one more week on Project 2. The new deadline is 24/07/2022.

    You need to use RDD with Scala to solve two problems. Lab 5 has been released already. Please follow the instructions to install Spark and practice RDD in the spark-shell. Labs 6 (more on RDD programming) and 7 (GraphX) will be released soon as well, which will help you work on Project 2.

    Regards,

    Xin

  • Java implementation of Project1

    Posted by Xin Cao Sunday 03 July 2022, 09:07:36 PM, last modified Sunday 03 July 2022, 10:54:58 PM.

    Dear All,

    I hope that you all have completed the first project.

    I just received an email reporting that the code can be run by Eclipse, but after being exported as a jar file, the jar cannot be run on Hadoop.

    Please double-check if you also have this problem. Please make sure that your code can be compiled as a jar and then can obtain correct results on Hadoop. Besides exporting the jar using Eclipse, you can also do it using commands.

    Assume that your java files are in a folder called "Project1", you can check by doing the below:

    1. cd Project1

    2. mkdir -p temp

    3. hadoop com.sun.tools.javac.Main *.java -d temp

    4. jar cf proj1.jar -C temp ./

    5. hdfs dfs -rm -r output

    6. hadoop jar proj1.jar comp9313.proj1.Project1 input output 19 2

    In addition, if you use Java, you do not need to add quotation marks to the results. In Python, MRJob adds the quotation marks automatically. The java version result on the test data has been released as well, which is basically the python result without the quotation marks. Note that there is no ";" at the end of each line.

    Regards,

    Xin

  • Please watch videos for this week's lectures

    Posted by Xin Cao Tuesday 28 June 2022, 12:45:43 AM.

    Dear All,

    I have been sick for the past few days, and I am still suffering from fever, sore throat, and continuous coughing. I am sorry that I am not able to teach this week, since I can hardly talk right now.

    Please watch the previous years' videos to learn Spark DataFrame, Spark SQL, and GraphX:

    https://au-lti.bbcollab.com/recording/6786e0c7914741c3a61227e499da2ec2

    https://au-lti.bbcollab.com/recording/be197718af0e44808e68bb55a26f1369

    This week's QA session will also be canceled. If you have any questions, please post them in the forum, and I will answer you there.

    My apologies again for the inconvenience caused. I will talk more about Spark DataFrame after the recess week.

    I will release one more test case for project 1 soon. Project 2 will be released on this Sunday, which requires you to use Spark RDD and GraphX to solve two problems respectively.

    Regards,

    Xin

  • This Sunday is the last day of dropping courses

    Posted by Xin Cao Friday 24 June 2022, 02:14:50 AM.

    Dear All,

    I hope that you have enjoyed the course so far.

    Just a reminder that after 11:59 on Sunday 26 June, you'll be charged a fee for enrolling in this course.

    If you feel that you have difficulties using Hadoop, doing the lab problems, working on the first project, or you do not quite like the course contents, you can consider dropping the course within this week. If you feel that you may not have enough time in this term, you can consider taking this course in the future. The third project will require you to spend enough effort to get a good mark.

    If you would like to stay in the course, I would appreciate it a lot if you could provide some feedback to me either via email or in the forum (you can do this anonymously in WebCMS3). I will try my best to improve the teaching of this course in the remaining weeks.

    Regards,

    Xin

  • Project 1 Released

    Posted by Xin Cao Monday 20 June 2022, 12:06:58 AM.

    Dear All,

    Project 1 has been released already, and you have two weeks to work on it. You can choose either Java or Python (MRJob).

    Labs 3 and 4 have been released as well, which can greatly help you complete the project. I strongly suggest that you first try the lab problems and then work on the project. As suggested by someone, the solutions will be released within the same week by the weekend.

    I still received some emails recently about installing and configuring Hadoop. If you now have some difficulties in making Hadoop work on your computer, please contact us as soon as possible. Let's solve your problem during the lab and QA sessions the next week.

    Regards,

    Xin


  • Hadoop Installation & Configuration and Project 1

    Posted by Xin Cao Friday 17 June 2022, 12:27:53 AM.

    Dear All,

    If you haven't successfully installed and configured Hadoop, please contact the course admin and me as soon as possible. I am sorry for the problem caused by the Mac M1 chip, which is unexpected. If your operations in the VM are lagged, please try to reduce the memory allocated to the VM from 8G to 4G.

    We are going to release the first project soon. I need to make sure that everyone already has the working environment and is well prepared for the project now.

    As already mentioned in the lecture, Project 1's due date has been moved to Week 5, and you will still have two full weeks to work on it.

    I will also release Labs 3 and 4 soon, which can greatly help you complete the first project.

    Regards,

    Xin


  • A new VM image with pre-installed Hadoop

    Posted by Xin Cao Thursday 09 June 2022, 07:47:31 PM.

    Dear All,

    I hope that you have successfully installed and configured Hadoop by following the instructions in Lab 1.

    For Windows users, I would suggest you use the virtual Ubuntu OS for doing the labs and projects.

    For Linux users, you just need to install and configure Hadoop on your own computer.

    For Mac users, if your computer does not use the M1 chip, you can still use the virtual machine, either VirtualBox or VMware. However, if your computer uses the M1 chip, you will have to install and configure Hadoop on Mac OS directly. I've just borrowed one such laptop, and I will see if there is any problem with the current lab document. I am sorry for the problem caused by this, which is unexpected. The VM image works perfectly on Mac in previous years.

    In case that you still cannot get Hadoop run on your Mac OS, one alternative solution might be to use UTM. You can try to download UTM and install Ubuntu in the virtual machine, and then install Hadoop by following Lab 1.

    A new VM image with pre-installed Hadoop can be downloaded from the two links below. Note that YARN is not configured. If you need to run YARN (e.g., required by MRJob), please follow Lab 1 to configure and then start it.

    https://mega.nz/file/amxgwK6A#bXwJX2ZydXaYFjEN1Qgl9iZ8SeRzVz15r8Z2_qj5eQw

    https://drive.google.com/file/d/1eqtd1TGXa7R0EdGAg6zvpfM6r68uveb_/view?usp=sharing

    If you have already successfully completed Lab 1, you can ignore this message.

    Let's try to solve everything related to Hadoop installation and configuration within this week. Lab 2 will be released later this week.

    Regards,

    Xin

  • VM image and Lab 1

    Posted by Xin Cao Friday 03 June 2022, 02:03:15 AM.

    Dear All,

    The next week's lab document has already been uploaded to WebCMS3.

    The VM image can be downloaded at:

    https://mega.nz/file/SqIz1Jpb#Ay5ioC4EkiQgZVuVYDUL6hfO2LiBvsJxjTXX1qAnxrg

    https://drive.google.com/file/d/1ymUkS422jiNnEKU2witPb2fIL8wf6eME/view

    Please download the image within this week, and get ready for next week's lab. Please let me know if there is any problem of downloading the image, especially the students offshore currently. Another option is that you download the Ubuntu OS from the official website, and then install Ubuntu 22.04 in VirutalBox by yourself. Or, you can also install and configure Hadoop on your own laptop, if you already have a Linux OS.

    For students who use Mac with M1 chips, it seems that both VirtualBox and VMware do not work. You will need to install and configure Hadoop on your own laptop. Please try to follow the lab instructions to see if you can successfully do it by next week. Contact us if you meet some difficulties.

    Regards,

    Xin

  • Welcome to COMP9313

    Posted by Xin Cao Sunday 29 May 2022, 10:28:57 PM.

    Dear COMP9313 Students,

    Welcome to COMP9313! I hope that you will enjoy the course in this term.

    The outline of the course is now available, please see " Course Outline " in the left panel .

    Due to COVID-19, the course will still be delivered online.

    Our first lecture is on Tuesday in Week-1. Please log into Moodle to access the lecture link. In case you cannot find it, you can also access through https://au.bbcollab.com/guest/cb6346ec1a4a4837821fbed6d514f43d .

    I look forward to seeing you all soon.

    Regards,

    Xin


Back to top

COMP9313 22T2 (Big Data Management) is powered by WebCMS3
CRICOS Provider No. 00098G