Search the data
|
Standardizing the collation format
After cleaning, the data is put
through a flexible process (a series of Python scripts) capable of carrying
out one, all or any combination of the treatments below.
-
Converting roman numeral volume numbers
The results of automating this update need to be carefully checked
as scanning errors can cause problems e.g. volume iv is frequently
mis-scanned as lv and translated automatically into volume 45 and
the character ‘l’ may be intended to be either letter
l (roman numeral for 50) or number 1, or is sometimes a scanning error
for ‘t.’ (tabula). We run this update on complete publication
datasets which then undergo checking, as blind conversions would undoubtedly
proliferate the number of errors and inconsistencies already present
in the data.
-
Converting roman numeral part numbers and formatting correctly
In ‘xiii. II. 381’ for example, part number (II.) has
to be distinguished from the volume number (xiii.) and the page number.
Often the year of publication is present in the collation field too.
The original punctuation has to be removed and the whole string reformatted
as ‘13(2): 381’.
-
Moving date from collation field to year field
Publication date is identified in the collation and removed to the
publication year field, provided it is not acting as volume number.
-
Copying publication date from collation field to year field
Where the volume number appears as the publication year this data
is not moved into the publication field but may be copied only if
the actual year of publication is the same.
-
Reordering the collation
May be necessary if the collation is non-standard in its format.
The collations of the legacy data in IPNI are widely varying in their
construction, there being no such thing as a standard format. Each dataset
we have dealt with has had peculiarities of its own (e.g. ‘Lit.’
in Linnaea, ‘Anhang’ in Repert. Spec. Nov. Regni Veg. Beih.)
Also, there are many records that include remarks, like ‘in syn.’
in the collation field or that contain a double reference citation, separated
by a semicolon. The latter all need to be checked carefully to determine
which is the valid reference.
See how this work
is progressing.
Back to IPNI Home
|