Automated validation of Swordsmith Index data

This is more like a note to myself as I was doing some research and found few links which may be useful in the future. When I started the Swordsmith Index project 4 years ago, I had to use automated tools to cleanse and verify smith records, everything Kanji related in particular, otherwise I would have to go through thousands of records manually. The results were very positive back in 2008-2009: my tools helped to identify numerous errors and gaps in the data (e.g. smith Kanji names) which are now long since been fixed. I didn't apply similar procedure to the signatures because the data was too rough and also I needed an access to a good database of Kanji and a Japanese dictionary. But most of all, my research in this direction has stopped since a lot of my input data was stolen in a burglary (and lost on a hard drive of my old worthless laptop which was taken).

Since then I switched to different activities, with full verification of the whole Index being the most important one. As it's been planned to be completed by the end of the year, I started looking at automated tools again. Unfortunately I won't be able to publish any results as I'm just starting a new job, but I might have some time over Christmas. One of the studies analyses distribution of different Kanji across provinces (it can be done across nengō as well) which shows (no surprises here) that some characters were preferred in particular provinces. Another study (which requires the tools mentioned below) is only in the planning stage and is meant to address different patterns of signatures which may help with signature verification, automated extraction of geographical locations and also with building some sort of comprehensive manual of signature reading.

The dictionaries I was meant to use were:

KANJIDIC2 for Kanji
JMdict/EDICT for Japanese/English vocabulary

Libraries:

JBLite (part of J-Ben, 2 )
JBParse, 2, 3
JMdict-Parser
nihongo.py

I'll keep you informed on any further development. Until then, I have 1750 records left to verify :-)

Regards,
Stan