Note: this message is only for students who will attend the supplementary exam (i.e., your current mark is WC).
The supplementary exam will take place next Friday (13th Jan), and the time is again from 1pm to 5pm.
I will send an email one day before the exam date to confirm your attendance. The exam paper will be emailed to you before the exam starts.
Dear All,
Some students told me that the mark received is LE.
I do not know what caused the problem. I will check with the school. The final mark is consistent with what you can see in Moodle now.
Update : You should have received your mark now. If it is not consistent with that in Moodle, please let me know. I have received several emails in the past few days. I am sorry for not replying to you yet because I was on a fever this Wednesday and cannot work much since then. I am better today, and I will process all the unread emails by tomorrow.
Regards,
Xin
Dear All,
The marks of project 3 have been released by the course admin. Please check if you have any questions regarding your mark and contact your tutor asap.
Regards,
Xin
My suggestion is to work on the mapreduce and spark problems later. For mapreduce questions, do not write Python codes. Otherwise, it will take you too much time.
1. In Question 1(b), some words were missing. It should be all the previous operations are very fast. Please see the updated version of the exam paper.
2. The code template of Question 3 has been made available on Moodle now. Please download it.
3. In Question 2(a), the EmployeeID is the primary key, and there could exist duplicate names.
4. In Question 6, please use \beta = 0.8 (jump probability = 1- \beta).
5. In Question 3(a), in the output, "student1: course3,2" should be "student1: course3,1".
The deadline is approaching. Considering the problem with WebCMS3 today, I would like to extend the submission deadline to 5:20. You still have half an hour left to check your answers, generate the pdf files, and submit your zip file to Moodle.
A few seconds' late does not matter. Please do not send me emails regarding this matter.
Dear All,
This is a final reminder of tomorrow's online exam, which will be held from 1pm to 5pm.
You will need to submit three files in Moodle: one python file for the RDD problem, one Python file for the DataFrame problem, and one pdf file including all your answers to the other questions. For MapReduce questions, you only need to provide the pseudo-code. For Spark questions, although the source code is required, it is still acceptable to have some minor errors (such as format issues or typos). The code template and the sample input/output will be given. You can either type your answers in a file on the computer or handwrite your answers on paper and then scan it. Please compress your pdf and the two python scripts into a zip file and then upload them in Moodle. Please get ready at least 5 minutes before the deadline to make sure that you have enough time to submit your files.
The exam paper will be released in WebCMS3. However, it seems that WebCMS3 is extremely slow recently. I even cannot upload the solution to project 2 in WebCMS3 these days. Therefore, I will also release the exam paper at 1pm on my homepage: http://www.cse.unsw.edu.au/~z3515164/exam.pdf . We will also use Moodle as the discussion forum during the exam.
You can now download the solution to project 2 at http://www.cse.unsw.edu.au/~z3515164/proj2_solution.zip . Note that the DataFrame question is much easier than project 2, considering the time limit of the final exam.
Regards,
Xin
Dear All,
It seems that the marking script used by the tutors has some minor problems, and I've let the tutors double-check their work. If you have emailed the tutor (or me), please be patient and wait for our reply. I've got about more than 50 unread emails. I will check all of them and track the tutors' progress in processing your requests over the weekend. Since we can update the marks until the middle of December, please be patient if your request has not been processed yet. I totally understand that every mark means a lot to you.
We already finished all the lectures and labs in T3. I hope that you've enjoyed the course so far. As I mentioned at the beginning of the term, this course is kind of an introduction to big data. I hope it can help you start learning more in-depth techniques in the future. It would be greatly appreciated if you could provide your feedback on the course in myExperience.
We will have one online consultation next week. If you have some questions regarding the final exam, please consider joining the consultation session. Wish you all the best in your current study and future career.
Regards,
Xin
1. Some students asked if it can be assumed that the elements are always integers. Initially, it is planned that the elements could also be of string type. However, I didn't notice that one tutor mentioned that you can assume that the elements are always integers. Considering the time left, I think I need to change the plan. You can simply convert the elements to integers by the int() function. Note that this makes the problem much easier.
2. The result of the large dataset has been released.
3. Google Dataproc is optional, but I strongly recommend you try on it for project 3, if you have obtained the credits from Google.
4. One more medium-sized dataset has been released for you to test your solution.
5. If you encounter the problem of “java.lang.OutOfMemoryError: Java heap space”, it means that your solution consumes too much memory, and you need to further optimize your method. You can also try to enlarge the driver and worker nodes' memory in the spark-submit command (see lecture slides).
Dear All,
The marks of project 1 have been released in Moodle. Please check your mark and the feedback. The solution will be released tomorrow.
If you feel that your submission is not marked correctly, please contact the tutor who marked your submission first. You can cc the email to me. We could update your mark since it is possible that the tutors made some mistakes during the marking.
Please note that this project aims to assess if you understand how to use MapReduce to solve big data problems efficiently. Correctness is not the only marking criterion. It is expected that you have two steps. The first step computes the probabilities using the order-inversion technique, and you need to use a special key in your solution. The second step utilizes the secondary sort technique to perform the sorting. You also need to implement a combiner (an in-mapper combiner is also OK) to improve the performance.
If you did not follow this approach, you may lose some marks, since your solution cannot be run on large datasets in a real distributed computing cluster (e.g., you use Python to sort the output). The 5 marks on the algorithm design are about these points.
If you have solved the lab problems or understand the lab solutions, it is not a difficult task to complete the project. It is great to see that many of you got full marks on the project.
Regards,
Xin
1. The deadline is midnight next Monday. The course admin has updated the info in Moodle already.
2. You can read the files from either HDFS or the local file system. Using the local files is more convenient, but you need to use the prefix "file:///...". Spark uses HDFS by default if the path does not have a prefix.
3. A test case has been provided. The previous one contains an error. Please download the new version to check the correctness.
4. Please make sure that your result has the same format as required in the project specification. For DataFrame, you can map the columns into a single string column and then save the data as a text file.
5. In the DataFrame solution, you can use pyspark.sql.functions.log10 to compute the term weights.
6. Please do not use numpy or pandas, since we aim to assess your understanding of the RDD/DataFrame APIs.
7. It is your own job to decide what is the best way to deal with the stop words.
8. You can use coalesce(1) to merge the data into a single partition and then save the data to disk.
9. In the DataFrame solution, please do not use the spark.sql() function to pass the SQL statement to Spark directly. The DataFrame APIs can be found here: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html . It is also allowed to use some Spark SQL functions to operate the columns: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html
10. It does not matter if you have a new line at the end of the output file or not. It will not affect the correctness of your solution.
Dear All,
From this week, I will provide another 1-hour offline consultation in my office, K17-201D.
The time is from 10 am to 11 am every Friday.
Please send me an email before you come to the offline consultation since I need to open the lab door for you.
I hope this offline consultation could provide more help to your study.
Regards,
Xin
As requested by some students, we will have a 1-hour consultation today, from 11:00 to 12:00.
Please join if you have questions regarding labs 5 and 6 and the project.
Dear All,
Week 6 is recess week, and we have no lab or lecture this week.
Lab 6 is released already, which aims to let you practice programming with Spark RDD APIs. Lab 7 will be released later this week as well, in which you will practice Spark DataFrame APIs.
Project 2 will be released 1 day later than expected. We are still preparing the test case for you to check the correctness. You will need to provide two solutions for the same problem using both RDD and DataFrame APIs. The deadline will be extended correspondingly, and you will still have two weeks to work on it.
Regards,
Xin
Project 1 will be due tomorrow. Please try to submit it on time. A few FAQs are as below:
1. You do not need to round the results. Use the provided test case to check the correctness of your solution.
2. You do not need to achieve a global order when using multiple reducers. With multiple reducers, you will have multiple output files in HDFS. You just need to guarantee the order within each file.
3. Remove the setting of "mapreduce.job.reduces" in JOBCONF, and so we can test your code with multiple reducers.
4. Some students told me that they can only obtain the correct results by moving "SORT_VALUES=True" out of the steps definition. I am not sure what caused the problem. You can do this if it can solve your problem.
5. Please check if you have the access to Moodle now. If not, please contact cse.teaching@unsw.edu.au as soon as possible.
6. To make JOBCONF work, you need to run your code on Hadoop. MRJob cannot sort and partition the mapper output when running locally.
7. When you meet an error, please always first check the logs in $HADOOP_HOME/logs/userlogs. Watch the lecture recording for more detailed steps.
Dear All,
It seems that some students are still struggling with installing and configuring Hadoop. Please contact me or your tutor to solve the problem asap.
I received several requests on extending the deadline. The new due date now is set to midnight of next Friday, and you will have five more days to work on the project.
The solution to lab 4 is now released. Please try to complete all the lab problems first, and then begin to work on the project.
One more test case prepared by the tutors is also given. Please contact us if you feel there is anything wrong with the test case.
Please make sure that your code can be run by the given command in the project specification. Please remove the setting of "mapreduce.job.reduces" in JOBCONF, and so we can test your code with multiple reducers. You just need to guarantee that the order within each reducer is correct.
Project 2 will be released after the deadline of the first project.
Regards,
Xin
Dear All,
Due to the new public holiday, we will have no lecture this Thursday.
This affects our lab a little bit. Please ignore the in-mapper combiner part in lab 2.
We will learn the MapReduce design patterns in the next week, and you can then come back to practice this part.
We will release the first project by the end of this week. Thus, it is important that you already have a working environment for MapReduce programming. If you still cannot install and configure Hadoop, please consult a tutor in this week's lab.
In case some newly enrolled students still do not have the access to Moodle, the link to tomorrow's lecture is given here: https://au.bbcollab.com/guest/4fcadc2ec8234ed0bf67fb4082261dfc
Regards,
Xin
T14A https://au.bbcollab.com/guest/ad4c766800ee48cd9fa915ac8a2ded67
T14B https://au.bbcollab.com/guest/0ce4ffd4e97c475b85c742ea52b306de
T16A https://au.bbcollab.com/guest/d9e7c98c67da4068803821d74d33eb66
T16B https://au.bbcollab.com/guest/3d431ecc2bfb4ee4b79dbfebfe7b9313
W09A https://au.bbcollab.com/guest/2fdaa7269929473f93b50093b923c589
W11A https://au.bbcollab.com/guest/d3f5a8bbb7ac4876aea125799215a181
W14A https://au.bbcollab.com/guest/a4a776d7c4114133a059b549f69fa958
W14B https://au.bbcollab.com/guest/f276523daf0946129917a529e6da3164
W16A https://au.bbcollab.com/guest/3f84bd7571bf433da9ff709de8fa2bee
Dear All,
Our first lab will start on Tuesday in Week 1. If your OS is Windows, you will need to use the virtual machine. The VM image can be downloaded at:
https://mega.nz/file/SqIz1Jpb#Ay5ioC4EkiQgZVuVYDUL6hfO2LiBvsJxjTXX1qAnxrg
https://drive.google.com/file/d/1ymUkS422jiNnEKU2witPb2fIL8wf6eME/view
Please download the image asap and let me know if there is any problem with downloading the image, especially the students offshore currently. Another option is that you can download the Ubuntu OS and VirtualBox from the official website, and then install Ubuntu 22.04 in VirtualBox by yourself.
Please watch out if you have enough memory in your host OS. You can decrease the memory allocated to the VM from 8G to 4G. Otherwise, you may encounter some errors.
For students in China now, please download Hadoop from this link: https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/common/hadoop-3.3.2/hadoop-3.3.2.tar.gz , which would be faster.
If your OS is Linux or Mac, you can install and configure Hadoop on your own laptop directly by following the lab instructions.
Regards,
Xin
Dear COMP9313 Students,
Welcome to COMP9313! Please see " Course Outline " in the left panel .
Due to COVID-19, the course will still be delivered online in T3. I hope that you will enjoy the course this term.
Our first lecture is on Tuesday in Week 1. Please log into Moodle to access the lecture link. In case you cannot find it, you can also access it through
https://au.bbcollab.com/guest/1453932904d745909916fcbae69a3a06
(the first) and
https://au.bbcollab.com/guest/318fc9e5ecf041d18894a13b7a4bd5a2
(the second).
Note that the labs start in Week 1 this term, which is different from previous offerings. You will need to set up the Hadoop working environment either using a virtual machine or on your own laptop in the first lab.
I look forward to seeing you all soon.
Regards,
Xin