CleaningUp one level
Is it necessary or recommendable to “clean” your data? Use text processing tools (e.g., Perl, Python, sed, or awk) to do this. (The script
sed.strip used for the Austen corpus may be used as inspiration). Reflect what information you will lose in cleaning and discuss if and under what conditions this is tolerable or not.
Create a vocabulary from the cleaned training data.