INTERNET AUTOPSY DATABASE IADB BUILD SNOMED TRANSLATIONS
FILES REQUIRED: IADBROU.RCS, IADBSNOM.TXT, LHENNILX.TXT
FILES REQUIRED: IADBNIFN.TXT, IADBNISF.TXT ;
EXECUTION SEQUENCE: IADBSNOV, IADBBASV, IADBCOPY ;
EXECUTION SEQUENCE: IADBSYNV, IADBOVLV, IADBOUTV ;
IADBROU.RCS: INTERNET AUTOPSY DATABASE MUMPS ROUTINES ;
IADBSNOM.TXT: PRIMARY SNOMED CODES AND DESCRIPTION ;
IADBSNOV.TXT: PRIMARY SNOMED VOCABULARY.
IADBBASV.TXT: BASELINE ENGLISH VOCABULARY, CONTAINING SNOV.
LHENNILX.TXT: JAPANESE OVERLAY VOCABULARY.
IADBNIFN.TXT: JAPANESE KATAKANA FONT.
IADBNISF.TXT: ENGLISH-JAPANESE SUFFIX TRANSLATIONS.
THE FOLLOWING VOCABULARIES ARE CONSTRUCTED:
SNOV: WORDS PRESENT IN PRIMARY SNOMED CODES USED IN IADB.
BASV: BASELINE ENGLISH VOCABULARY, MUST CONTAIN SNOV.
SYNV: SYNTHETIC VOCABULARY, BUILT FROM SUFFIX FILES.
OVLV: OVERLAY VOCABULARY, FROM OTHER RANDOM SOURCES.
OUTV: FOREIGN OUTPUT VOCABULARY, TRANSLATED FROM IADBSNOM.TXT.
PRNV: PROPER NAME VOCABULARY.
The INTERNET AUTOPSY DATABASE (IADB) is a collection
of over 49,000 autopsy facesheets (one-page summaries),
translated into several languages,
and offered in the public domain on the Internet.
In the 1998 IAD version, there are 9,235 primary
SNOMED descriptions, which reduce to 5,892 distinct words.
At approximately 14 keystrokes per foreign word
(not including the original English source word),
these 5,892 words require 77,462 keystrokes.
An average secretary can type 8,000 keystrokes per hour.
Therefore, the key-entry component ALONE of adding a new language
to the IAD is less than 10 hours. As a further shortcut,
many foreign words can be constructed algorithmically,
as for example: ischemia ==> Ischaemie (German).
There are three types of vocabulary for each foreign language:
common, idiosyncratic vocabulary (articles, prepositions, pronouns,
common verbs and modifiers);
common medical vocabulary (body parts, signs, symptoms);
and synthetic vocabulary, constructed from suffixes and prefixes.
The IAD translator first constructs a synthetic vocabulary,
then overlays this vocabulary with idiosyncratic and medical
vocabularies, as well as corrections for errors in the
synthetic vocabulary. The routines are called as follows:
IADBSNOM.TXT is the file of SNOMED descriptions
which actually appear in the IAD autopsy facesheets.
IADBSNOV collects the SNOMED vocabulary from file IADBSNOM.TXT
IADBBASV collects a larger English vocabulary
from various sources. The main purpose of IADBBASV
is to create and refine file IADBNISF.TXT.
IADBCOPY copies SNOV vocabulary into BASV.
IADBSYNV creates synthetic translations
from the suffix file, IADBNISF.TXT.
IADBOVLV overlays ordinary vocabulary
and corrections of synthetic translations
into the final vocabulary.
IADBOUTV downloads the combined synthetic and overlay
vocabularies into a file formatted suitably for publication
on the internet.
TRANSOFT FOREIGN LANGUAGE TRANSLATIONS.
INTRODUCTION.
TRANSOFT is a software system used by the
Internet Autopsy Database (IAD) to create a
word-for-word translation (technically,
a GLOSSARY) for all the autopsy facesheets
(i.e., autopsy summaries) in the the IAD.
Literature references are posted on the IAD website.
TRANSOFT translations are word-for-word,
non-grammatical and non-context-sensitive.
Grammatical translations are projected for
future editions of the IAD. The present
translations are intended solely as an aid
to non-English speakers using the IAD,
and as a step toward internationalizing
the contributorship and readership of the IAD,
in the same spirit as the translations
of the Systematized Nomenclature of Human
and Veterinary Medicine (SNOMED International),
currently in progress under the sponsorship of
the College of American Pathologists.
FUTURE OF TRANSOFT IN THE IAD
Many of the proposed features of the IAD
are not yet implemented, but will be put in
place gradually, as time and resources permit.
The purpose of the following discussion
is to plan a system of sufficient
generality and scope that improvements
can readily be entered seamlessly
into the existing IAD structure.
The principal tasks of constructing
a TRANSOFT glossary include: Romanization,
sorting, and vocabulary building.
CONTRIBUTIONS TO THE IAD GLOSSARY.
All additions to the IAD glossary from
international contributors are welcome.
The materials should take the form of a worldwide
website, in which it is clearly indicated that
the materials are cost-free and may be distributed
worldwide without restriction. At the present time,
there are two sources on the worldwide web
which have substantially shaped the contents
and philosophy of the IAD glossary. They are:
(1) The SYSTRAN/ALTAVISTA cost-free translator
for immediate translations of short texts
between English and a choice of German, French,
Spanish, Portuguese, and Italian, at:
http://www.altavista.digital.com/
(2) The European Community medical glossary,
containing approximately 2,000 medical terms
in English (EN), German (DE), French (FR),
Spanish (ES), Portuguese (PO), Italian (IT),
Dutch (NE), and Danish (DA), at:
http://www...eugloss/=rivest/EN
FOREIGN LANGUAGES IN THE IAD.
Foreign languages supported by the IAD
are designated by the first two letters of the
name of the language IN THAT LANGUAGE.
Thus, for example, ENglish, FRancais (=French),
DEutsch (=German), etc. To avoid confusion
with Portuguese, the Polish language is
designated as PLska.
Foreign languages that do not use the Roman alphabet
are designated by their Roman equivalents, according to
Romanizations described below. Thus,
for example, ELlenikos (=Greek), ROsskiy (=Russian).
Thus:
DEutsch ESpanol FRancais ITaliano NOrsk NIhon
NEderlands ELlenikos
SUomi SVensk DAnsk POrtugues PLska TUrkce
MAgyar ROsskiy UKrainskiy ZHong NIhon
TRANSOFT ROMANIZATIONS.
In TRANSOFT, all languages are required
to have an unambiguous mapping into
the plain Roman alphabet. On the other hand,
insofar as possible, the output forms
of foreign languages on the IAD
should most closely resemble the
manner in which these languages
are displayed on the Internet
by native-language Internet contributors,
i.e., complete with diacritical marks
and non-Roman alphabets. For most foreign
languages that are now heavily used
on the Internet, there is an emerging
de facto standard for display.
THE PLAIN ROMAN ALPHABET.
The plain Roman alphabet is the
twenty-six letters of the Roman
alphabet as used in English,
without diacritical marks,
and sorted in the traditional sequence
for English, i.e., a,b,c,d,....
As described below, the plain Roman alphabet
is augmented with the forward slash (ASCII xx),
to indicate various diacritical marks, because
the forward slash is easy to reach
on a standard keyboard and is NOT
an escape character in UNIX, PERL,
JavaScript, or Java, which are
the programming languages commonly
used on the Internet.
REASONS FOR REQUIRING THE PLAIN ROMAN ALPHABET.
There are four major reasons
for the plain Roman alphabet requirement:
(1) The majority of inexpensive, easy-to-use
word processor and sorting software systems
(including XY-write, Aurora, and MUMPS,
which are used by TRANSOFT) are available
for the plain Roman alphabet.
(2) There is a simple, universal standard
for sorting (collating, alphabetizing)
all words written in the plain Roman alphabet,
sorting by one letter at a time, so that
these words can be located reliably
in a list by humans or by computer software.
Interestingly, even the Roman alphabet
augmented by diacritical marks cannot necessarily
be sorted one letter at a time. For example,
there is NO one-letter-at-a-time collating
sequence which can reproduce the official collating
sequence specified by the Academie Francaise
for the French language.
(3) UNIX, the operating system
of the Internet, uses the Roman alphabet.
Electronic mail and worldwide web pages
(written in HTML) often require the plain
Roman alphabet. Even 8-bit ASCII letters
in the augmented Roman alphabet
can result in transmission errors and
security breeches.
(4) Touch-typing, which is necessary
for rapid creation and updating of dictionaries,
is most efficiently done in the plain Roman alphabet.
TRANSOFT COLLATING SEQUENCES.
English has twenty-six letters
with no diacritical marks, and a
standard sequence for these letters,
namely, a, b, c, .... The absence
of diacritical marks is a requirement
for a dictionary to be sorted
one letter at a time. This means,
all words in the dictionary are
primarily sorted by the first letter
in each word. Then within each
first-letter group, all words
are secondarily sorted by the second
letter in each word. Then within
each first-and-second-letter group,
all words are tertiarily sorted
by the third letter in each word, etc.
This simple sorting method is not
possible in languages with
diacritical marks, such as German,
French, or Spanish. The ability
to rapidly sort a list of words
is critical in efficient
dictionary management.
There is an interesting
paradox, illustrated by the
a-umlaut character as it is
understood in German (a-umlaut
or a-diaeresis) versus in Swedish
(a-tremak). The characters appear
exactly the same on a computer monitor
or on the printed page, but in German
the umlaut is regarded as a detachable
element from the base letter a,
i.e., as a diacritical mark,
whereas in Swedish, a-tremak is
a unified concept, in which the
tremak is no more detachable from
the a than, say, the horizontal stroke
is detachable from t. The a-tremak
letter appears after z in a Swedish
dictionary. As a result, a German
dictionary CANNOT be collated by a single-letter
sort, whereas a Swedish dictionary CAN.
CLASSES OF ROMANIZATIONS.
All modern, written non-Roman-alphabet languages
in widespread use already have Romanizations, but
these Romanizations are often internally
inconsistent or difficult to use.
The four classes of writing systems
for Romanization are:
(1) English, which has twenty-six letters,
with no diacritical marks,
and a standard sequence for these letters,
namely, a, b, c, ....
(2) Augmented Roman alphabets, i.e.,
Roman alphabets with diacritical marks,
such as accents, umlauts, cedillas, etc.,
including German, French,
Portuguese, Polish, Turkish, Vietnamese.
(3) Short non-Roman alphabets, such as
Greek, Cyrillic, Hebrew, Arabic, Armenian,
Devangari, Amharic, Thai, Myanmar, etc.
(4) Chinese/Japanese/Korean (traditional),
which require thousands of characters,
or ideographs, for appropriate display.
TRANSOFT has a philosophy and methods for managing,
sorting, and displaying all these classes of non-Roman
language systems.
AUGMENTED ROMAN ALPHABETS.
All the augmented Roman alphabets
are represented in TRANSOFT as the
plain Roman alphabet, augmented by
the forward slash (ASCII xx).
The forward slash is employed
because it is easy to reach on a
standard computer keyboard and is NOT
an escape character in UNIX, PERL,
JavaScript, or Java, which are
the programming languages commonly
used on the Internet. In foreign languages
in which each plain Roman
letter has at most one diacritical
mark, this letter is represented
in TRANSOFT by a single forward slash.
For example, in Spanish:
a/=a-acute, e/=e-acute, n/=n-tilde, etc.
In Italian:
a/=a-grave, e/=e-grave, etc.
In Turkish:
c/=c-cedilla, g/=g-breve, i/=dotless-i, etc.
However, in a language such as French,
for example, the letter i has three distinct
diacritical marks (ignoring i-umlaut).
These are represented in TRANSOFT as:
i/=i-acute, i//=i-grave, i///=i-circumflex.
Similarly, in Polish:
z/=z-acute, z//=z-dot.
Diacritical marks within the Roman font are represented as:
German: ae/ oe/ ue/ Ae/ Oe/ Ue/ ss/
French: a/ e/ (acute); a// e// (grave)
French: a/// e/// (circumflex); c/ (cedilla)
Spanish: a/ e/ i/ o/ u/ (acute)
Portuguese: a/ e/ (acute) a// e// (tilde) c/ (cedilla)
Italian: a/ e/ i/ o/ u/ (grave)
Latin: a/ e/ i/ o/ u/ (macron)
Turkish: c/ i/ g/ o/ s/ u/
Finnish: ae/ oe/
Swedish: aa/ ae/ oe/
Danish: aa/ ae/ oe/
Norwegian: aa/ ae/ oe/
Polish: a/ c/ e/ l/ n/ o/ s/ z/ (z-acute) z// (z-dot)
Hungarian: a/ e/ i/ o/ o// (diaeresis) o/// (double acute) u/ u// u///
Greek: h=eta j=theta x=xi u=upsilon c=chi y=psi w=omega
TRANSOFT MAPS FOR SHORT NON-ROMAN ALPHABETS.
Short non-Roman alphabets often have
reasonably intuitive mappings
into the Roman alphabet.
TRANSOFT Single-letter Romanization for Greek:
Greek: a=alpha b=beta g=gamma d=delta h=eta
Greek: z=zeta j=theta i=iota k=kappa l=lambda
Greek: m=mu n=nu o=omicron p=pi r=rho s=sigma
Greek: t=tau u=upsilon x=xi c=chi f=phi
Greek: y=psi w=omega
Mnemonics for the non-intuitive
features for this Romanization are:
Upper-case eta is H.
Lower-case psi resembles y; both at the end of the alphabet.
Lower-case omega resembles w; both at the end of the alphabet.
J is in approximately the same position in the alphabet as theta.
C is the first letter of chi.
TRANSOFT Single-letter Romanization for Hebrew:
Hebrew: a=aleph b=beth g=gimel d=daleth h=hey
Hebrew: v=vav z=zayin c=chet j=tet y=yod k=kaf
Hebrew: l=lameth m=mem n=nun s=samech u=ayin
Hebrew: p=pey f=tsade q=qof r=resh x=shin w=tav
Mnemonics for the non-intuitive
features for this Romanization are:
U is a vowel in the middle of the alphabet, like ayin.
J is a consonant near the start of the alphabet, like tet.
X is a sibilant at the end of the alphabet, like shin.
W is at the end of the alphabet, like tav.
F is a consonant not otherwise used, for tsade.
TRANSOFT Cyrillic cannot be mapped into single letters
of the plain Roman alphabet, since there are
33 Cyrillic letters and only 26 plain Roman letters.
Since there are only 20 Cyrillic consonants
and 21 plain Roman consonants, TRANSOFT
employs the following consonant mapping:
TRANSOFT Single-letter Romanization for Cyrillic consonants:
b=be v=ve g=ge d=de j=zhe z=ze k=ke l=el m=em
n=no p=pi r=ar s=es t=te f=ef c=khe y=tse q=che h=she x=shche
CHINESE/JAPANESE/KOREAN (TRADITIONAL) ALPHABETS.
Chinese, Japanese, and traditional Korean
(i.e., South Korean), character sets, include
thousands of characters, or ideographs,
for appropriate display. The complete set
of these numerous characters cannot readily
be mapped into the Roman alphabet.
For Japanese, there are two syllabaries,
namely Hiragana and Katakana,
which TRANSOFT maps into a two-letter
Romanization, as follows:
aa ii uu ee oo
ka ki ku ke ko
sa si su se so
ta ti tu te to....
The more popular Romanizations use a combination
of 1-letter, 2-letter, and 3-letter representations,
which more closely approximates the sound values
of these syllables, but which is less convenient
for software to analyze:
a i u e o
ka ki ku ke ko
sa shi su se so
ta chi tsu te to....
For Korean, there is the Hangul alphabet,
which TRANSOFT maps into a one-letter
Romanization, as follows:
For Chinese and for the Chinese-based parts
of the Japanese and traditional Korean character
sets, the ideographs may be separated into components,
or RADICALs, which have short English names,
given in the file IADBIDEO.TXT, and managed
by the MUMPS program, IADBIDEO. The most popular
Internet fonts are:
Japanese: Shift Japan Industrial Standard (SJIS).
Korean:
Traditional Chinese: BG5 (used in Taiwan, Macao, etc.).
Simplified Chinese: GB (used in Mainland China).
CONSTRUCTION OF AN IAD VOCABULARY.
The IAD vocabulary contains four classes of words
for translation:
(1) Common words, including articles, prepositions,
auxiliary verbs, pronouns, conjunctions,
and common verbs. Every language has only a few
thousand such words, which can be key-entered
in a single day of focused labor. Most elementary
textbooks for a foreign language will contain these words.
(2) Common medical terms, such as body parts
and common diseases, signs, and symptoms.
Many foreign language travel guides, such as the
Berlitz series, includes a chapter
('visit to the doctor') containing most of these words.
(3) Synthetic words, in which a foreign
word may be synthesized by a computer algorithm
from the corresponding English word, such
as Haematologie (German) from hemato+logy (English).
The late Friedrich Wingert pioneered many of these
methods for the SNOMED translation into German,
and once claimed that 75% of SNOMED words can
be translated in this manner.
The MUMPS program IADBSYNV performs this synthesis.
(4) Overlay words, i.e., words which do
not fit into groups (1), (2), or (3), or which
give an incorrect result when synthesized.
The MUMPS program IADBOVLV performs this overlay.
VOCABULARY SOURCES FOR THE IAD
There are three types of vocabulary for each foreign language:
common, idiosyncratic vocabulary (articles, prepositions, pronouns,
common verbs and modifiers);
common medical vocabulary (body parts, signs, symptoms);
and synthetic vocabulary, constructed from suffixes and prefixes.
The IAD translator first constructs a synthetic vocabulary,
then overlays this vocabulary with idiosyncratic and medical
vocabularies, as well as corrections for errors in the
synthetic vocabulary. The routines are called as follows:
IADBSNOM.TXT is the file of SNOMED descriptions
which actually appear in the IAD autopsy facesheets.
IADBSNOV collects the SNOMED vocabulary from file IADBSNOM.TXT
IADBBASV collects a larger English vocabulary
from various sources. The main purpose of IADBBASV
is to create and refine file IADBNISF.TXT.
IADBCOPY copies SNOV vocabulary into BASV.
IADBSYNV creates synthetic translations
from the suffix file, IADBNISF.TXT.
IADBOVLV overlays ordinary vocabulary
and corrections of synthetic translations
into the final vocabulary.
IADBOUTV downloads the combined synthetic and overlay
vocabularies into a file formatted suitably for publication
on the internet.