---
layout: default
title: Preparsed UCD
parent: Design Docs
---

<!--
© 2016 and later: Unicode, Inc. and others.
License & terms of use: http://www.unicode.org/copyright.html
-->

# Preparsed UCD

## What

A text file with preparsed UCD ([Unicode Character
Database](http://www.unicode.org/ucd/)) data.

*   Preparser script:
    [tools/unicode/py/**preparseucd.py**](https://github.com/unicode-org/icu/blob/master/tools/unicode/py/preparseucd.py)
*   ppucd.txt output:
    [icu4c/source/data/unidata/**ppucd.txt**](https://github.com/unicode-org/icu/blob/master/icu4c/source/data/unidata/ppucd.txt)
    ([raw text
    version](https://raw.githubusercontent.com/unicode-org/icu/master/icu4c/source/data/unidata/ppucd.txt))
*   Parser for ppucd.txt:
    [icu4c/source/tools/toolutil/**ppucd.h**](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/toolutil/ppucd.h)
    &
    [.cpp](https://github.com/unicode-org/icu/blob/master/icu4c/source/tools/toolutil/ppucd.cpp)
*   genprops tool rewritten to use that:
    [tools/unicode/c/**genprops**](https://github.com/unicode-org/icu/tree/master/tools/unicode/c/genprops)

## Syntax

```
# Preparsed UCD generated by ICU preparseucd.py
```

Only whole-line comments starting with #, no inline comments.

```
ucd;10.0.0
```

Data lines start with a type keyword. Data fields are semicolon-separated. The
number of fields per line is highly variable.

The ucd line should be the first data line. It provides the Unicode version
number.

```
property;Binary;Alpha;Alphabetic
property;Enumerated;bc;Bidi_Class
```

Property lines define properties with a type and two or more aliases.

```
binary;N;No;F;False
binary;Y;Yes;T;True
value;bc;ON;Other_Neutral
```

Property value lines define the values of enumerated and catalog properties,
with the property short name and two or more aliases for each value.

There is only one shared definition of the values and aliases for binary
properties.

```
defaults;0000..10FFFF;age=NA;bc=L;blk=NB;bpt=n;cf=<code point>;dm=<code point>;dt=None;ea=N;FC_NFKC=<code point>;gc=Cn;GCB=XX;gcm=Cn;hst=NA;InPC=NA;InSC=Other;jg=No_Joining_Group;jt=U;lb=XX;lc=<code point>;NFC_QC=Y;NFD_QC=Y;NFKC_CF=<code point>;NFKC_QC=Y;NFKD_QC=Y;nt=None;SB=XX;sc=Zzzz;scf=<code point>;scx=<script>;slc=<code point>;stc=<code point>;suc=<code point>;tc=<code point>;uc=<code point>;vo=R;WB=XX
```

After the version, property, and property value lines, and before other data
lines, the defaults line defines default values for all code points
(corresponding to @missing data in the UCD). Any properties not mentioned here
default to null values according to their type, such as False or the empty
string.

The general syntax of this line is the same as for the following data lines:

1.  Line type keyword.
2.  Code point or start..end range (inclusive end).
3.  Zero or more property values.
    *   Binary values are given by their property name alone if True ("Alpha"),
        or with a minus sign prepended ("-Alpha").
    *   Other values are given as "pname=value" pairs, where pname is the
        property name.
    *   In the ppucd.txt file, short names of properties and values are used,
        but parsers should be prepared to accept any of the aliases according to
        the earlier sections of the file.
    *   In the ppucd.txt file, properties are listed in sorted order, but this
        is not required by the syntax.

```
block;20000..2A6DF;age=3.1;Alpha;blk=CJK_Ext_B;ea=W;gc=Lo;Gr_Base;IDC;Ideo;IDS;lb=ID;SB=LE;sc=Hani;UIdeo;vo=U;XIDC;XIDS
# 20000..2A6D6 CJK Unified Ideographs Extension B
algnamesrange;20000..2A6D6;han;CJK UNIFIED IDEOGRAPH-
cp;20001;nt=Nu;nv=7
cp;20064;nt=Nu;nv=4
unassigned;2A6D7..2A6DF;ea=W;lb=ID;vo=U
# No block
unassigned;2A6E0..2A6FF;ea=W;lb=ID;vo=U
algnamesrange;AC00..D7A3;hangul
```

Block lines specify a Unicode Block and provide an opportunity for compact data
lines for ranges inside the block, by listing common property values once for
the whole block. Block properties override the defaults for cp and unassigned
lines with code point ranges inside the block. The file syntax and parser do not
require the presence of block lines.

cp lines provide the data for a code point or range. They override the
default+block properties. Properties that are not mentioned fall back to the
block, then to the defaults.

Unassigned lines (new in ICU 60 for Unicode 10) provide the data for an
unassigned code point or range (gc=Cn). They override only the default
properties, except for the blk=Block property (if the range is inside a block).
Properties that are not mentioned fall back to the defaults, except that the
blk=Block property applies to unassigned lines as well.

A range is considered inside a block if it is fully inside the range of the last
defined block. Otherwise it is considered outside a block and falls back only to
the defaults. This is the case even if the range is inside an earlier block, to
simplify parsing & processing (such data lines should be avoided).

A range inside the block for which there is no data line inherits all of the
default+block properties (see Han blocks). Note that this is very different from
the behavior of an unassigned line, in particular since such blocks typically
default to gc!=Cn.

Non-default properties for unassigned ranges inside and outside of blocks are
typically for [complex
defaults](http://www.unicode.org/reports/tr44/#Default_Values_Table) and for
noncharacters.

ppucd.txt data lines are in code point order, although this should not be
strictly required.

Assigned characters normally have their unique na=Name property value. For
Hangul syllables with their algorithmically computed names, the entire range is
covered by the line "algnamesrange;AC00..D7A3;hangul". For ranges of ideographic
characters, a line like "algnamesrange;20000..2A6D6;han;CJK UNIFIED IDEOGRAPH-"
provides a Name prefix which is to be followed by the code point (in hex like
%04lX).

## Why not UCD .txt files?

See [UAX #44 "Unicode Character Database"](http://www.unicode.org/reports/tr44/)

Nontrivial parsing:

*   The UCD has grown from a couple of semicolon-delimited files plus an
    informative "Property dump" (early PropList.txt) to a collection of dozens
    of files with a variety of (now more regular) formats.
*   Related properties are scattered over several files.
*   Full information for Numeric_Value and Numeric_Type requires parsing two
    files.
*   Default values are "hidden" in comments.
*   The UCD folder structure (which file where) has changed over time.
*   UCD filenames change during each Unicode beta period. (A detailed version
    number is inserted into each filename.)
*   Many files are bloated with comments that show the General Category and name
    of each character or range start/end; if the data were combined into a
    single file, then all properties for a character or range would be listed
    together, without need for such comments.

Nontrivial patching: Adding characters (e.g., PUA or proposed/draft) requires
adding data in many of the UCD files.

ICU already preprocesses some of the UCD .txt files. We strip comments from some
files (because they are huge) and in some files merge adjacent same-property
code points into ranges.

Some changes are manual, such as updating and adding ranges of algorithmic
character names.

Then we run several tools, most of them twice, to parse different sets of .txt
files and write several output files. We use several Python and shell scripts,
and a "log" (unidata/changes.txt) with details of what was changed and run in
each Unicode version upgrade.

Markus has done ICU Unicode updates since about 2002. Someone else might have a
hard time picking this up for maintenance and future Unicode version updates.

### Why not UCD XML files?

See [UAX #42 "Unicode Character Database in
XML"](http://www.unicode.org/reports/tr42/)

Good: The UCD XML file format stores all properties in a single file with a
relatively simple structure, with property values as XML attributes.

Issues:

*   **Missing data** which is needed for ICU
    *   Name_Alias added in UCD 5.0 but missing in UCD XML as of UCD 6.1 beta.
    *   Script_Extensions added in UCD 6.0 but not "blessed" as a Unicode
        property as of UCD 6.1. Useful, used in ICU, but not available in UCD
        XML.
    *   Adopting UCD XML would require to either still also parse some UCD .txt
        files or write another tool to merge more data into the XML.
*   Dependency on third party
    *   Lag time between UCD .txt vs. XML availability during beta.
    *   Unable to fix/update/extend XML generator tools.
    *   For new properties, need to wait for standardization (UAX #42), tool
        update, and XML publication.
    *   Will not support custom/nonstandard data.
*   Could be simpler: Parsing XML is easy in Java, Python, etc. and doable in
    C++ (we have a "poor man's" XML parser), but not as easy as
    `line.split(";")`.
    *   There is no need for complex structure for the UCD.
*   Could be easier to read for humans: By not storing defaults for all of
    Unicode in one place, each `<group>` carries them, making it hard to see which
    values are specific to each group. "Fluffy" XML makes for longer text lines,
    more horizontal scrolling.
*   Hard to diff: The XML format can be used in different ways, and Unicode
    publishes different forms of the same data. Also, the precise XML text
    depends on the XML formatting code used.
    *   For diffing, a special tool needs to be run, parse old & new XML data,
        compare values and generate a diff report. Unicode publishes some of
        those too.
*   Some data still requires nontrivial parsing.
    *   For algorithmic character names, the range needs to be determined by
        collecting a contiguous sequence of elements with a shared name pattern.
        There is not even any special notation for the algorithmic names for
        Hangul syllables.
*   Minor: Unnecessary data (for ICU)
    *   Precomputed Hangul syllable names
    *   Irrelevant contributory properties like "Other_Xyz"
    *   Properties not used by ICU
*   Minor, just awkward: Blocks are treated as auxiliary data, rather than as a
    core means to organize and store the data. On the other hand, the "grouped"
    XML files also use them as the basis for the `<group>` elements and associated
    compaction. (The "flat" files don't.)

## Goals

*   Single file with all data relevant for ICU.
*   Very easy to parse and use the data in C/C++ tools.
*   Easily human readable.
*   Easy-to-read diffs from standard diff tools.
*   Compact file format.
*   Conversion tool easy to write, maintain, extend.
*   Convert from UCD .txt files because those are maintained directly by the UTC
    & editorial committee. No waiting for third party to convert the files.
*   Able to extend for new kinds of data.
*   Easy format for manual data fixes/additions (e.g., PUA or proposed/draft).
*   Move much of the parsing from scattered C code into one Python script.

## Details

*   All-Unicode defaults in one place, but only list non-null default values.
    (`blk=No_Block, cf=<code point>, ...`)
*   Line-oriented, always semicolon-separated, with type-of-line in the first
    field.
*   Block properties override defaults; only for few properties where properties
    in the block have common, non-default values.
    *   Effective because blocks represent actual allocation & organization of
        Unicode. Maintained by UTC.
*   Code point/range properties override default+block properties.
*   Algorithmic names stored as ranges with type & shared name prefixes (for
    CJK).
*   No gratuitous white space or syntax characters.
*   Mostly key=value, simpler format for binary properties. Easy to read.
*   Comment lines with headings from NamesList.txt further improve readability.
    (There are few of them, so no significant size bloat.)
*   Simple, stable file generation allows diffing.
    *   E.g., list properties in sorted order of property names.
*   No need to implement/store properties that are not used in ICU. (But format
    & tool are easy to extend.)

## Plan

*   (done) Write Python tool to preparse UCD .txt files and generate one output
    ppucd.txt file.
*   (done) Subsume existing ucdcopy.py.
*   (done) Write toolutil C++ parser for ppucd.txt, add ppucd.txt to the unidata
    folder.
*   (done) Merge genbidi, gencase, gennames, gennorm into genprops
    *   Replace scattered many-.txt parsers with calls to the toolutil ppucd.txt
        parser.
    *   Generate all output files in one genprops invocation.
    *   Update makeprops.sh (delete half of it) & changes.txt.
*   (done) Make preparseucd.py also parse uchar.h & uscript.h and write the
    property names data header file. (was: ~~Change genpname/preparse.pl to read
    ppucd.txt rather than Property\[Value\]Aliases.txt.~~)
*   (done) Consider changing pnames_data.h so that minor changes don't change
    most of the file contents.
*   (done) Write wiki/Markus/ReviewTicket8972 with diff links.
    *   2019-sep-27: The old Trac server is going away. I copied the wiki page
        contents into a comment on
        [ICU-8972](https://unicode-org.atlassian.net/browse/ICU-8972).
*   Move UCD tests from cintltst to intltest, change to use the toolutil
    ppucd.txt parser. ([ticket
    #9041](https://unicode-org.atlassian.net/browse/ICU-9041))
*   Change Java UCD tests to parse & use ppucd.txt. (ticket #9041)
*   (partially done) Change Python preparser to not copy input UCD .txt files
    any more, delete them from unidata & Java. (ticket #9041)

## Other tool improvements

**Bad**: Until **ICU 4.8**, the process is

build & install ICU -> build Unicode tools -> run genpname -> build & install
ICU (now with updated property names) -> build Unicode tools -> run UCD parsers
-> build & install ICU (now also with case properties & normalization etc.) ->
build Unicode tools -> run genuca -> build & install ICU

It should be possible to

1.  merge the Unicode tools into one binary
2.  parameterize the relevant properties code (property name lookup, case & some
    other properties, NFC)
3.  inject newly built data into the common library for the next part of the
    merged Unicode tool's processing.

**ICU 49**:

build & install ICU -> build Unicode tools -> run genprops -> build & install
ICU (now with updated properties) -> build Unicode tools -> run genuca -> build
& install ICU

genprops builds the property (value) names data and injects it into the live
ppucd.txt parser for further processing.

**Goal**:

build & install ICU -> build Unicode tool -> run it -> build & install ICU (now
with all updated Unicode data)

Requires [ticket #9040](https://unicode-org.atlassian.net/browse/ICU-9040),
could be "hard".
