Mostly generated with a script from Wikipedia data (only the typical positive ratio is slightly modified). This is a first test before adding my generating script to the main tree.
Models are language specific (there could be several models for the same charset but different languages). Let's have a clear naming scheme.