Don't forget to fill out the SES survey
There should be a link on your course's Moodle page
1. I saw some requests for more test cases in the forum. Thus, one more test data is provided. If you use Java, the running script is also provided, including packaging your java files and run your jar file on Hadoop (please change the document number and the reducer number accordingly).
2. Please use the pre-installed Hadoop at ~/hadoop. Lab 1 only aims to let you know how to install and configure Hadoop. Please delete the ~/workdir folder after you compete Lab 1, as well as the corresponding configurations in ~/.bashrc.
3. Before you run mrjob code on Hadoop, please start both HDFS and YARN. Please check if you have configured YARN correctly by following the instructions of Lab 1, including two files: $HADOOP_CONF_DIR/mapred-site.xml and $HADOOP_CONF_DIR/yarn-site.xml.
4. The "\t" means the tab character, not a string "\t". Because one tab character may take 4 or 8 space characters, in the editor and in the terminal the texts may be displayed differently.
5. Please try your best to debug your code, and then ask questions. You can first test your code locally, and then run on Hadoop. Note that it is very possible that your code can generate correct results locally but fails on Hadoop. There must be something wrong due to the key partitions. In mrjob, you can first test your mapper and then test your reducer. To test the mapper, you can write a simple reducer which writes the mapper output directly to the reducers. By doing so, you will be able to know if your mapper can send the key-value pairs to the reducers as expected. After the mapper is OK, you can proceed to test your reducer.
6. A variable defined in mapper_init/reducer_init and mapper/reducer has different scopes. If it is defined in mapper/reducer, it can only be seen within this mapper function call for the current input. If it is defined in mapper_init/reducer_init, it can be seen by all mapper/reducer functions within each mapper/reducer.
7. It is strongly recommended to complete the two problems in Lab 4 first, and then work on the project. Otherwise, you will meet many problems during working on the project.
You still have one weekend plus one week to work on the first project.
1. Lab 4 is released already, which will help you writing codes for project 1, especially on how to use the partitioner and comparator class in mrjob (if you use java, the lab provides you some practices on defining a custom partitioner and defining an order for your keys).
If you do not know how to work on the project now, please first complete the problems in Lab 4, and then you will have better ideas on solving the project problem.
2. It is allowed to pass the number of documents as an argument in the python version of the project. To make it fair, if you use java, you are also allowed to do so. I have updated the project description for the java version. Please download the new document.
3. I've made a mistake in slide 21 of Chapter 3.1 on how to use the partitioner class in mrjob. I have updated that slide, and please download a new version as well.
Due to that the previous consultation time collides with some students' lab session, the consultation time will be changed from 4pm-5pm on Thursday to 12pm-1pm on Tuesday (after the lecture) from the next week.
The project will be released on next Monday, and you will have two weeks working on it. I suggest you first complete the lab programming activities and then work on the project.
Please always remember to download a new version of the lecture slides after each lecture, since I may do some edits (such as correcting the typos or adding a few new slides).