Personal tools
You are here: Home Personen Dr.-Ing. Michael Piotrowski, M.A. Exercises NLS II Cleaning
Document Actions

Cleaning

Up one level

Is it necessary or recommendable to “clean” your data? Use text processing tools (e.g., Perl, Python, sed, or awk) to do this. (The script sed.strip used for the Austen corpus may be used as inspiration). Reflect what information you will lose in cleaning and discuss if and under what conditions this is tolerable or not.

Create a vocabulary from the cleaned training data.



Powered by Plone, the Open Source Content Management System