diff --git a/third_party/README.md b/third_party/README.md index e17040ef0..655c826e8 100644 --- a/third_party/README.md +++ b/third_party/README.md @@ -1,8 +1,20 @@ - * [python_kaldi_features](https://github.com/ZitengWang/python_kaldi_features) commit: fc1bd6240c2008412ab64dc25045cd872f5e126c ref: https://zhuanlan.zhihu.com/p/55371926 +licence: MIT * [python-pinyin](https://github.com/mozillazg/python-pinyin.git) - commit: 55e524aa1b7b8eec3d15c5306043c6cdd5938b03 - licence: MIT +commit: 55e524aa1b7b8eec3d15c5306043c6cdd5938b03 +licence: MIT + +* [zhon](https://github.com/tsroten/zhon) +commit: 09bf543696277f71de502506984661a60d24494c +licence: MIT + +* [pymmseg-cpp](https://github.com/pluskid/pymmseg-cpp.git) +commit: b76465045717fbb4f118c4fbdd24ce93bab10a6d +licence: MIT + +* [chinese_text_normalization](https://github.com/speechio/chinese_text_normalization.git) +commit: 9e92c7bf2d6b5a7974305406d8e240045beac51c +licence: MIT diff --git a/third_party/chinese_text_normalization/.gitignore b/third_party/chinese_text_normalization/.gitignore new file mode 100644 index 000000000..f50f06f32 --- /dev/null +++ b/third_party/chinese_text_normalization/.gitignore @@ -0,0 +1,2 @@ +*~ +*.far diff --git a/third_party/chinese_text_normalization/LICENSE b/third_party/chinese_text_normalization/LICENSE new file mode 100644 index 000000000..c6be42fba --- /dev/null +++ b/third_party/chinese_text_normalization/LICENSE @@ -0,0 +1,21 @@ +MIT License + +Copyright (c) 2020 SpeechIO + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +SOFTWARE. diff --git a/third_party/chinese_text_normalization/README.md b/third_party/chinese_text_normalization/README.md new file mode 100644 index 000000000..fd5182594 --- /dev/null +++ b/third_party/chinese_text_normalization/README.md @@ -0,0 +1,112 @@ +# Chinese Text Normalization for Speech Processing + +## Problem + +Search for "Text Normalization"(TN) on Google and Github, you can hardly find open-source projects that are "read-to-use" for text normalization tasks. Instead, you find a bunch of NLP toolkits or frameworks that *supports* TN functionality. There is quite some work between "support text normalization" and "do text normalization". + +## Reason + +* TN is language-dependent, more or less. + + Some of TN processing methods are shared across languages, but a good TN module always involves language-specific knowledge and treatments, more or less. + +* TN is task-specific. + + Even for the same language, different applications require quite different TN. + +* TN is "dirty" + + Constructing and maintaining a set of TN rewrite-rules is painful, whatever toolkits and frameworks you choose. Subtle and intrinsic complexities hide inside TN task itself, not in tools or frameworks. + +* mature TN module is an asset + + Since constructing and maintaining TN is hard, it is actually an asset for commercial companies, hence it is unlikely to find a product-level TN in open-source community (correct me if you find any) + +* TN is a less important topic for either academic or commercials. + +## Goal + +This project sets up a ready-to-use TN module for **Chinese**. Since my background is **speech processing**, this project should be able to handle most common TN tasks, in **Chinese ASR** text processing pipelines. + +## Normalizers + +1. supported NSW (Non-Standard-Word) Normalization + + |NSW type|raw|normalized| + |-|-|-| + |cardinal|这块黄金重达324.75克|这块黄金重达三百二十四点七五克| + |date|她出生于86年8月18日,她弟弟出生于1995年3月1日|她出生于八六年八月十八日 她弟弟出生于一九九五年三月一日| + |digit|电影中梁朝伟扮演的陈永仁的编号27149|电影中梁朝伟扮演的陈永仁的编号二七一四九| + |fraction|现场有7/12的观众投出了赞成票|现场有十二分之七的观众投出了赞成票| + |money|随便来几个价格12块5,34.5元,20.1万|随便来几个价格十二块五 三十四点五元 二十点一万| + |percentage|明天有62%的概率降雨|明天有百分之六十二的概率降雨| + |telephone|这是固话0421-33441122
这是手机+86 18544139121|这是固话零四二一三三四四一一二二
这是手机八六一八五四四一三九一二一| + + acknowledgement: the NSW normalization codes are based on [Zhiyang Zhou's work here](https://github.com/Joee1995/chn_text_norm.git) + +1. punctuation removal + + For Chinese, it removes punctuation list collected in [Zhon](https://github.com/tsroten/zhon) project, containing + * non-stop puncs + ``` + '"#$%&'()*+,-/:;<=>@[\]^_`{|}~⦅⦆「」、、〃》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏' + ``` + * stop puncs + ``` + '!?。。' + ``` + + For English, it removes Python's `string.punctuation` + +1. multilingual English word upper/lower case conversion + since ASR/TTS lexicons usually unify English entries to uppercase or lowercase, the TN module should adapt with lexicon accordingly. + +## Supported text format + +1. plain text, preferably one sentence per line(most common case in ASR processing). + ``` + 今天早饭吃了没 + 没吃回家吃去吧 + ... + ``` + plain text is default format. + +2. Kaldi's transcription format + ``` + KALDI_KEY_UTT001 今天早饭吃了没 + KALDI_KEY_UTT002 没吃回家吃去吧 + ... + ``` + TN will skip first column key section, normalize latter transcription text + + pass `--has_key` option to switch to kaldi format. + +_note: All input text should be UTF-8 encoded._ + +## Run examples + +* TN (python) + +make sure you have **python3**, python2.X won't work correctly. + +`sh run.sh` in `TN` dir, and compare raw text and normalized text. + +* ITN (thrax) + +make sure you have **thrax** installed, and your PATH should be able to find thrax binaries. + +`sh run.sh` in `ITN` dir. check Makefile for grammar dependency. + +## possible future work + +Since TN is a typical "done is better than perfect" module in context of ASR, and the current state is sufficient for my purpose, I probably won't update this repo frequently. + +there are indeed something that needs to be improved: + +* For TN, NSW normalizers in TN dir are based on regular expression, I've found some unintended matches, those pattern regexps need to be refined for more precise TN coverage. + +* For ITN, extend those thrax rewriting grammars to cover more scenarios. + +* Further more, nowadays commercial systems start to introduce RNN-like models into TN, and a mix of (rule-based & model-based) system is state-of-the-art. More readings about this, look for Richard Sproat and KyleGorman's work at Google. + +END diff --git a/third_party/chinese_text_normalization/python/cn_tn.py b/third_party/chinese_text_normalization/python/cn_tn.py new file mode 100755 index 000000000..bac1c19ea --- /dev/null +++ b/third_party/chinese_text_normalization/python/cn_tn.py @@ -0,0 +1,794 @@ +#!/usr/bin/env python3 +# coding=utf-8 +# Authors: +# 2019.5 Zhiyang Zhou (https://github.com/Joee1995/chn_text_norm.git) +# 2019.9 Jiayu DU +# +# requirements: +# - python 3.X +# notes: python 2.X WILL fail or produce misleading results + +import sys, os, argparse, codecs, string, re + +# ================================================================================ # +# basic constant +# ================================================================================ # +CHINESE_DIGIS = u'零一二三四五六七八九' +BIG_CHINESE_DIGIS_SIMPLIFIED = u'零壹贰叁肆伍陆柒捌玖' +BIG_CHINESE_DIGIS_TRADITIONAL = u'零壹貳參肆伍陸柒捌玖' +SMALLER_BIG_CHINESE_UNITS_SIMPLIFIED = u'十百千万' +SMALLER_BIG_CHINESE_UNITS_TRADITIONAL = u'拾佰仟萬' +LARGER_CHINESE_NUMERING_UNITS_SIMPLIFIED = u'亿兆京垓秭穰沟涧正载' +LARGER_CHINESE_NUMERING_UNITS_TRADITIONAL = u'億兆京垓秭穰溝澗正載' +SMALLER_CHINESE_NUMERING_UNITS_SIMPLIFIED = u'十百千万' +SMALLER_CHINESE_NUMERING_UNITS_TRADITIONAL = u'拾佰仟萬' + +ZERO_ALT = u'〇' +ONE_ALT = u'幺' +TWO_ALTS = [u'两', u'兩'] + +POSITIVE = [u'正', u'正'] +NEGATIVE = [u'负', u'負'] +POINT = [u'点', u'點'] +# PLUS = [u'加', u'加'] +# SIL = [u'杠', u'槓'] + +# 中文数字系统类型 +NUMBERING_TYPES = ['low', 'mid', 'high'] + +CURRENCY_NAMES = '(人民币|美元|日元|英镑|欧元|马克|法郎|加拿大元|澳元|港币|先令|芬兰马克|爱尔兰镑|' \ + '里拉|荷兰盾|埃斯库多|比塞塔|印尼盾|林吉特|新西兰元|比索|卢布|新加坡元|韩元|泰铢)' +CURRENCY_UNITS = '((亿|千万|百万|万|千|百)|(亿|千万|百万|万|千|百|)元|(亿|千万|百万|万|千|百|)块|角|毛|分)' +COM_QUANTIFIERS = '(匹|张|座|回|场|尾|条|个|首|阙|阵|网|炮|顶|丘|棵|只|支|袭|辆|挑|担|颗|壳|窠|曲|墙|群|腔|' \ + '砣|座|客|贯|扎|捆|刀|令|打|手|罗|坡|山|岭|江|溪|钟|队|单|双|对|出|口|头|脚|板|跳|枝|件|贴|' \ + '针|线|管|名|位|身|堂|课|本|页|家|户|层|丝|毫|厘|分|钱|两|斤|担|铢|石|钧|锱|忽|(千|毫|微)克|' \ + '毫|厘|分|寸|尺|丈|里|寻|常|铺|程|(千|分|厘|毫|微)米|撮|勺|合|升|斗|石|盘|碗|碟|叠|桶|笼|盆|' \ + '盒|杯|钟|斛|锅|簋|篮|盘|桶|罐|瓶|壶|卮|盏|箩|箱|煲|啖|袋|钵|年|月|日|季|刻|时|周|天|秒|分|旬|' \ + '纪|岁|世|更|夜|春|夏|秋|冬|代|伏|辈|丸|泡|粒|颗|幢|堆|条|根|支|道|面|片|张|颗|块)' + +# punctuation information are based on Zhon project (https://github.com/tsroten/zhon.git) +CHINESE_PUNC_STOP = '!?。。' +CHINESE_PUNC_NON_STOP = '"#$%&'()*+,-/:;<=>@[\]^_`{|}~⦅⦆「」、、〃《》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏' +CHINESE_PUNC_OTHER = '·〈〉-' +CHINESE_PUNC_LIST = CHINESE_PUNC_STOP + CHINESE_PUNC_NON_STOP + CHINESE_PUNC_OTHER + +# ================================================================================ # +# basic class +# ================================================================================ # +class ChineseChar(object): + """ + 中文字符 + 每个字符对应简体和繁体, + e.g. 简体 = '负', 繁体 = '負' + 转换时可转换为简体或繁体 + """ + + def __init__(self, simplified, traditional): + self.simplified = simplified + self.traditional = traditional + #self.__repr__ = self.__str__ + + def __str__(self): + return self.simplified or self.traditional or None + + def __repr__(self): + return self.__str__() + + +class ChineseNumberUnit(ChineseChar): + """ + 中文数字/数位字符 + 每个字符除繁简体外还有一个额外的大写字符 + e.g. '陆' 和 '陸' + """ + + def __init__(self, power, simplified, traditional, big_s, big_t): + super(ChineseNumberUnit, self).__init__(simplified, traditional) + self.power = power + self.big_s = big_s + self.big_t = big_t + + def __str__(self): + return '10^{}'.format(self.power) + + @classmethod + def create(cls, index, value, numbering_type=NUMBERING_TYPES[1], small_unit=False): + + if small_unit: + return ChineseNumberUnit(power=index + 1, + simplified=value[0], traditional=value[1], big_s=value[1], big_t=value[1]) + elif numbering_type == NUMBERING_TYPES[0]: + return ChineseNumberUnit(power=index + 8, + simplified=value[0], traditional=value[1], big_s=value[0], big_t=value[1]) + elif numbering_type == NUMBERING_TYPES[1]: + return ChineseNumberUnit(power=(index + 2) * 4, + simplified=value[0], traditional=value[1], big_s=value[0], big_t=value[1]) + elif numbering_type == NUMBERING_TYPES[2]: + return ChineseNumberUnit(power=pow(2, index + 3), + simplified=value[0], traditional=value[1], big_s=value[0], big_t=value[1]) + else: + raise ValueError( + 'Counting type should be in {0} ({1} provided).'.format(NUMBERING_TYPES, numbering_type)) + + +class ChineseNumberDigit(ChineseChar): + """ + 中文数字字符 + """ + + def __init__(self, value, simplified, traditional, big_s, big_t, alt_s=None, alt_t=None): + super(ChineseNumberDigit, self).__init__(simplified, traditional) + self.value = value + self.big_s = big_s + self.big_t = big_t + self.alt_s = alt_s + self.alt_t = alt_t + + def __str__(self): + return str(self.value) + + @classmethod + def create(cls, i, v): + return ChineseNumberDigit(i, v[0], v[1], v[2], v[3]) + + +class ChineseMath(ChineseChar): + """ + 中文数位字符 + """ + + def __init__(self, simplified, traditional, symbol, expression=None): + super(ChineseMath, self).__init__(simplified, traditional) + self.symbol = symbol + self.expression = expression + self.big_s = simplified + self.big_t = traditional + + +CC, CNU, CND, CM = ChineseChar, ChineseNumberUnit, ChineseNumberDigit, ChineseMath + + +class NumberSystem(object): + """ + 中文数字系统 + """ + pass + + +class MathSymbol(object): + """ + 用于中文数字系统的数学符号 (繁/简体), e.g. + positive = ['正', '正'] + negative = ['负', '負'] + point = ['点', '點'] + """ + + def __init__(self, positive, negative, point): + self.positive = positive + self.negative = negative + self.point = point + + def __iter__(self): + for v in self.__dict__.values(): + yield v + + +# class OtherSymbol(object): +# """ +# 其他符号 +# """ +# +# def __init__(self, sil): +# self.sil = sil +# +# def __iter__(self): +# for v in self.__dict__.values(): +# yield v + + +# ================================================================================ # +# basic utils +# ================================================================================ # +def create_system(numbering_type=NUMBERING_TYPES[1]): + """ + 根据数字系统类型返回创建相应的数字系统,默认为 mid + NUMBERING_TYPES = ['low', 'mid', 'high']: 中文数字系统类型 + low: '兆' = '亿' * '十' = $10^{9}$, '京' = '兆' * '十', etc. + mid: '兆' = '亿' * '万' = $10^{12}$, '京' = '兆' * '万', etc. + high: '兆' = '亿' * '亿' = $10^{16}$, '京' = '兆' * '兆', etc. + 返回对应的数字系统 + """ + + # chinese number units of '亿' and larger + all_larger_units = zip( + LARGER_CHINESE_NUMERING_UNITS_SIMPLIFIED, LARGER_CHINESE_NUMERING_UNITS_TRADITIONAL) + larger_units = [CNU.create(i, v, numbering_type, False) + for i, v in enumerate(all_larger_units)] + # chinese number units of '十, 百, 千, 万' + all_smaller_units = zip( + SMALLER_CHINESE_NUMERING_UNITS_SIMPLIFIED, SMALLER_CHINESE_NUMERING_UNITS_TRADITIONAL) + smaller_units = [CNU.create(i, v, small_unit=True) + for i, v in enumerate(all_smaller_units)] + # digis + chinese_digis = zip(CHINESE_DIGIS, CHINESE_DIGIS, + BIG_CHINESE_DIGIS_SIMPLIFIED, BIG_CHINESE_DIGIS_TRADITIONAL) + digits = [CND.create(i, v) for i, v in enumerate(chinese_digis)] + digits[0].alt_s, digits[0].alt_t = ZERO_ALT, ZERO_ALT + digits[1].alt_s, digits[1].alt_t = ONE_ALT, ONE_ALT + digits[2].alt_s, digits[2].alt_t = TWO_ALTS[0], TWO_ALTS[1] + + # symbols + positive_cn = CM(POSITIVE[0], POSITIVE[1], '+', lambda x: x) + negative_cn = CM(NEGATIVE[0], NEGATIVE[1], '-', lambda x: -x) + point_cn = CM(POINT[0], POINT[1], '.', lambda x, + y: float(str(x) + '.' + str(y))) + # sil_cn = CM(SIL[0], SIL[1], '-', lambda x, y: float(str(x) + '-' + str(y))) + system = NumberSystem() + system.units = smaller_units + larger_units + system.digits = digits + system.math = MathSymbol(positive_cn, negative_cn, point_cn) + # system.symbols = OtherSymbol(sil_cn) + return system + + +def chn2num(chinese_string, numbering_type=NUMBERING_TYPES[1]): + + def get_symbol(char, system): + for u in system.units: + if char in [u.traditional, u.simplified, u.big_s, u.big_t]: + return u + for d in system.digits: + if char in [d.traditional, d.simplified, d.big_s, d.big_t, d.alt_s, d.alt_t]: + return d + for m in system.math: + if char in [m.traditional, m.simplified]: + return m + + def string2symbols(chinese_string, system): + int_string, dec_string = chinese_string, '' + for p in [system.math.point.simplified, system.math.point.traditional]: + if p in chinese_string: + int_string, dec_string = chinese_string.split(p) + break + return [get_symbol(c, system) for c in int_string], \ + [get_symbol(c, system) for c in dec_string] + + def correct_symbols(integer_symbols, system): + """ + 一百八 to 一百八十 + 一亿一千三百万 to 一亿 一千万 三百万 + """ + + if integer_symbols and isinstance(integer_symbols[0], CNU): + if integer_symbols[0].power == 1: + integer_symbols = [system.digits[1]] + integer_symbols + + if len(integer_symbols) > 1: + if isinstance(integer_symbols[-1], CND) and isinstance(integer_symbols[-2], CNU): + integer_symbols.append( + CNU(integer_symbols[-2].power - 1, None, None, None, None)) + + result = [] + unit_count = 0 + for s in integer_symbols: + if isinstance(s, CND): + result.append(s) + unit_count = 0 + elif isinstance(s, CNU): + current_unit = CNU(s.power, None, None, None, None) + unit_count += 1 + + if unit_count == 1: + result.append(current_unit) + elif unit_count > 1: + for i in range(len(result)): + if isinstance(result[-i - 1], CNU) and result[-i - 1].power < current_unit.power: + result[-i - 1] = CNU(result[-i - 1].power + + current_unit.power, None, None, None, None) + return result + + def compute_value(integer_symbols): + """ + Compute the value. + When current unit is larger than previous unit, current unit * all previous units will be used as all previous units. + e.g. '两千万' = 2000 * 10000 not 2000 + 10000 + """ + value = [0] + last_power = 0 + for s in integer_symbols: + if isinstance(s, CND): + value[-1] = s.value + elif isinstance(s, CNU): + value[-1] *= pow(10, s.power) + if s.power > last_power: + value[:-1] = list(map(lambda v: v * + pow(10, s.power), value[:-1])) + last_power = s.power + value.append(0) + return sum(value) + + system = create_system(numbering_type) + int_part, dec_part = string2symbols(chinese_string, system) + int_part = correct_symbols(int_part, system) + int_str = str(compute_value(int_part)) + dec_str = ''.join([str(d.value) for d in dec_part]) + if dec_part: + return '{0}.{1}'.format(int_str, dec_str) + else: + return int_str + + +def num2chn(number_string, numbering_type=NUMBERING_TYPES[1], big=False, + traditional=False, alt_zero=False, alt_one=False, alt_two=True, + use_zeros=True, use_units=True): + + def get_value(value_string, use_zeros=True): + + striped_string = value_string.lstrip('0') + + # record nothing if all zeros + if not striped_string: + return [] + + # record one digits + elif len(striped_string) == 1: + if use_zeros and len(value_string) != len(striped_string): + return [system.digits[0], system.digits[int(striped_string)]] + else: + return [system.digits[int(striped_string)]] + + # recursively record multiple digits + else: + result_unit = next(u for u in reversed( + system.units) if u.power < len(striped_string)) + result_string = value_string[:-result_unit.power] + return get_value(result_string) + [result_unit] + get_value(striped_string[-result_unit.power:]) + + system = create_system(numbering_type) + + int_dec = number_string.split('.') + if len(int_dec) == 1: + int_string = int_dec[0] + dec_string = "" + elif len(int_dec) == 2: + int_string = int_dec[0] + dec_string = int_dec[1] + else: + raise ValueError( + "invalid input num string with more than one dot: {}".format(number_string)) + + if use_units and len(int_string) > 1: + result_symbols = get_value(int_string) + else: + result_symbols = [system.digits[int(c)] for c in int_string] + dec_symbols = [system.digits[int(c)] for c in dec_string] + if dec_string: + result_symbols += [system.math.point] + dec_symbols + + if alt_two: + liang = CND(2, system.digits[2].alt_s, system.digits[2].alt_t, + system.digits[2].big_s, system.digits[2].big_t) + for i, v in enumerate(result_symbols): + if isinstance(v, CND) and v.value == 2: + next_symbol = result_symbols[i + + 1] if i < len(result_symbols) - 1 else None + previous_symbol = result_symbols[i - 1] if i > 0 else None + if isinstance(next_symbol, CNU) and isinstance(previous_symbol, (CNU, type(None))): + if next_symbol.power != 1 and ((previous_symbol is None) or (previous_symbol.power != 1)): + result_symbols[i] = liang + + # if big is True, '两' will not be used and `alt_two` has no impact on output + if big: + attr_name = 'big_' + if traditional: + attr_name += 't' + else: + attr_name += 's' + else: + if traditional: + attr_name = 'traditional' + else: + attr_name = 'simplified' + + result = ''.join([getattr(s, attr_name) for s in result_symbols]) + + # if not use_zeros: + # result = result.strip(getattr(system.digits[0], attr_name)) + + if alt_zero: + result = result.replace( + getattr(system.digits[0], attr_name), system.digits[0].alt_s) + + if alt_one: + result = result.replace( + getattr(system.digits[1], attr_name), system.digits[1].alt_s) + + for i, p in enumerate(POINT): + if result.startswith(p): + return CHINESE_DIGIS[0] + result + + # ^10, 11, .., 19 + if len(result) >= 2 and result[1] in [SMALLER_CHINESE_NUMERING_UNITS_SIMPLIFIED[0], + SMALLER_CHINESE_NUMERING_UNITS_TRADITIONAL[0]] and \ + result[0] in [CHINESE_DIGIS[1], BIG_CHINESE_DIGIS_SIMPLIFIED[1], BIG_CHINESE_DIGIS_TRADITIONAL[1]]: + result = result[1:] + + return result + + +# ================================================================================ # +# different types of rewriters +# ================================================================================ # +class Cardinal: + """ + CARDINAL类 + """ + + def __init__(self, cardinal=None, chntext=None): + self.cardinal = cardinal + self.chntext = chntext + + def chntext2cardinal(self): + return chn2num(self.chntext) + + def cardinal2chntext(self): + return num2chn(self.cardinal) + +class Digit: + """ + DIGIT类 + """ + + def __init__(self, digit=None, chntext=None): + self.digit = digit + self.chntext = chntext + + # def chntext2digit(self): + # return chn2num(self.chntext) + + def digit2chntext(self): + return num2chn(self.digit, alt_two=False, use_units=False) + + +class TelePhone: + """ + TELEPHONE类 + """ + + def __init__(self, telephone=None, raw_chntext=None, chntext=None): + self.telephone = telephone + self.raw_chntext = raw_chntext + self.chntext = chntext + + # def chntext2telephone(self): + # sil_parts = self.raw_chntext.split('') + # self.telephone = '-'.join([ + # str(chn2num(p)) for p in sil_parts + # ]) + # return self.telephone + + def telephone2chntext(self, fixed=False): + + if fixed: + sil_parts = self.telephone.split('-') + self.raw_chntext = ''.join([ + num2chn(part, alt_two=False, use_units=False) for part in sil_parts + ]) + self.chntext = self.raw_chntext.replace('', '') + else: + sp_parts = self.telephone.strip('+').split() + self.raw_chntext = ''.join([ + num2chn(part, alt_two=False, use_units=False) for part in sp_parts + ]) + self.chntext = self.raw_chntext.replace('', '') + return self.chntext + + +class Fraction: + """ + FRACTION类 + """ + + def __init__(self, fraction=None, chntext=None): + self.fraction = fraction + self.chntext = chntext + + def chntext2fraction(self): + denominator, numerator = self.chntext.split('分之') + return chn2num(numerator) + '/' + chn2num(denominator) + + def fraction2chntext(self): + numerator, denominator = self.fraction.split('/') + return num2chn(denominator) + '分之' + num2chn(numerator) + + +class Date: + """ + DATE类 + """ + + def __init__(self, date=None, chntext=None): + self.date = date + self.chntext = chntext + + # def chntext2date(self): + # chntext = self.chntext + # try: + # year, other = chntext.strip().split('年', maxsplit=1) + # year = Digit(chntext=year).digit2chntext() + '年' + # except ValueError: + # other = chntext + # year = '' + # if other: + # try: + # month, day = other.strip().split('月', maxsplit=1) + # month = Cardinal(chntext=month).chntext2cardinal() + '月' + # except ValueError: + # day = chntext + # month = '' + # if day: + # day = Cardinal(chntext=day[:-1]).chntext2cardinal() + day[-1] + # else: + # month = '' + # day = '' + # date = year + month + day + # self.date = date + # return self.date + + def date2chntext(self): + date = self.date + try: + year, other = date.strip().split('年', 1) + year = Digit(digit=year).digit2chntext() + '年' + except ValueError: + other = date + year = '' + if other: + try: + month, day = other.strip().split('月', 1) + month = Cardinal(cardinal=month).cardinal2chntext() + '月' + except ValueError: + day = date + month = '' + if day: + day = Cardinal(cardinal=day[:-1]).cardinal2chntext() + day[-1] + else: + month = '' + day = '' + chntext = year + month + day + self.chntext = chntext + return self.chntext + + +class Money: + """ + MONEY类 + """ + + def __init__(self, money=None, chntext=None): + self.money = money + self.chntext = chntext + + # def chntext2money(self): + # return self.money + + def money2chntext(self): + money = self.money + pattern = re.compile(r'(\d+(\.\d+)?)') + matchers = pattern.findall(money) + if matchers: + for matcher in matchers: + money = money.replace(matcher[0], Cardinal(cardinal=matcher[0]).cardinal2chntext()) + self.chntext = money + return self.chntext + + +class Percentage: + """ + PERCENTAGE类 + """ + + def __init__(self, percentage=None, chntext=None): + self.percentage = percentage + self.chntext = chntext + + def chntext2percentage(self): + return chn2num(self.chntext.strip().strip('百分之')) + '%' + + def percentage2chntext(self): + return '百分之' + num2chn(self.percentage.strip().strip('%')) + + +# ================================================================================ # +# NSW Normalizer +# ================================================================================ # +class NSWNormalizer: + def __init__(self, raw_text): + self.raw_text = '^' + raw_text + '$' + self.norm_text = '' + + def _particular(self): + text = self.norm_text + pattern = re.compile(r"(([a-zA-Z]+)二([a-zA-Z]+))") + matchers = pattern.findall(text) + if matchers: + # print('particular') + for matcher in matchers: + text = text.replace(matcher[0], matcher[1]+'2'+matcher[2], 1) + self.norm_text = text + return self.norm_text + + def normalize(self): + text = self.raw_text + + # 规范化日期 + pattern = re.compile(r"\D+((([089]\d|(19|20)\d{2})年)?(\d{1,2}月(\d{1,2}[日号])?)?)") + matchers = pattern.findall(text) + if matchers: + #print('date') + for matcher in matchers: + text = text.replace(matcher[0], Date(date=matcher[0]).date2chntext(), 1) + + # 规范化金钱 + pattern = re.compile(r"\D+((\d+(\.\d+)?)[多余几]?" + CURRENCY_UNITS + r"(\d" + CURRENCY_UNITS + r"?)?)") + matchers = pattern.findall(text) + if matchers: + #print('money') + for matcher in matchers: + text = text.replace(matcher[0], Money(money=matcher[0]).money2chntext(), 1) + + # 规范化固话/手机号码 + # 手机 + # http://www.jihaoba.com/news/show/13680 + # 移动:139、138、137、136、135、134、159、158、157、150、151、152、188、187、182、183、184、178、198 + # 联通:130、131、132、156、155、186、185、176 + # 电信:133、153、189、180、181、177 + pattern = re.compile(r"\D((\+?86 ?)?1([38]\d|5[0-35-9]|7[678]|9[89])\d{8})\D") + matchers = pattern.findall(text) + if matchers: + #print('telephone') + for matcher in matchers: + text = text.replace(matcher[0], TelePhone(telephone=matcher[0]).telephone2chntext(), 1) + # 固话 + pattern = re.compile(r"\D((0(10|2[1-3]|[3-9]\d{2})-?)?[1-9]\d{6,7})\D") + matchers = pattern.findall(text) + if matchers: + # print('fixed telephone') + for matcher in matchers: + text = text.replace(matcher[0], TelePhone(telephone=matcher[0]).telephone2chntext(fixed=True), 1) + + # 规范化分数 + pattern = re.compile(r"(\d+/\d+)") + matchers = pattern.findall(text) + if matchers: + #print('fraction') + for matcher in matchers: + text = text.replace(matcher, Fraction(fraction=matcher).fraction2chntext(), 1) + + # 规范化百分数 + text = text.replace('%', '%') + pattern = re.compile(r"(\d+(\.\d+)?%)") + matchers = pattern.findall(text) + if matchers: + #print('percentage') + for matcher in matchers: + text = text.replace(matcher[0], Percentage(percentage=matcher[0]).percentage2chntext(), 1) + + # 规范化纯数+量词 + pattern = re.compile(r"(\d+(\.\d+)?)[多余几]?" + COM_QUANTIFIERS) + matchers = pattern.findall(text) + if matchers: + #print('cardinal+quantifier') + for matcher in matchers: + text = text.replace(matcher[0], Cardinal(cardinal=matcher[0]).cardinal2chntext(), 1) + + # 规范化数字编号 + pattern = re.compile(r"(\d{4,32})") + matchers = pattern.findall(text) + if matchers: + #print('digit') + for matcher in matchers: + text = text.replace(matcher, Digit(digit=matcher).digit2chntext(), 1) + + # 规范化纯数 + pattern = re.compile(r"(\d+(\.\d+)?)") + matchers = pattern.findall(text) + if matchers: + #print('cardinal') + for matcher in matchers: + text = text.replace(matcher[0], Cardinal(cardinal=matcher[0]).cardinal2chntext(), 1) + + self.norm_text = text + self._particular() + + return self.norm_text.lstrip('^').rstrip('$') + + +def nsw_test_case(raw_text): + print('I:' + raw_text) + print('O:' + NSWNormalizer(raw_text).normalize()) + print('') + + +def nsw_test(): + nsw_test_case('固话:0595-23865596或23880880。') + nsw_test_case('固话:0595-23865596或23880880。') + nsw_test_case('手机:+86 19859213959或15659451527。') + nsw_test_case('分数:32477/76391。') + nsw_test_case('百分数:80.03%。') + nsw_test_case('编号:31520181154418。') + nsw_test_case('纯数:2983.07克或12345.60米。') + nsw_test_case('日期:1999年2月20日或09年3月15号。') + nsw_test_case('金钱:12块5,34.5元,20.1万') + nsw_test_case('特殊:O2O或B2C。') + nsw_test_case('3456万吨') + nsw_test_case('2938个') + nsw_test_case('938') + nsw_test_case('今天吃了115个小笼包231个馒头') + nsw_test_case('有62%的概率') + + +if __name__ == '__main__': + #nsw_test() + + p = argparse.ArgumentParser() + p.add_argument('ifile', help='input filename, assume utf-8 encoding') + p.add_argument('ofile', help='output filename') + p.add_argument('--to_upper', action='store_true', help='convert to upper case') + p.add_argument('--to_lower', action='store_true', help='convert to lower case') + p.add_argument('--has_key', action='store_true', help="input text has Kaldi's key as first field.") + p.add_argument('--log_interval', type=int, default=100000, help='log interval in number of processed lines') + args = p.parse_args() + + ifile = codecs.open(args.ifile, 'r', 'utf8') + ofile = codecs.open(args.ofile, 'w+', 'utf8') + + n = 0 + for l in ifile: + key = '' + text = '' + if args.has_key: + cols = l.split(maxsplit=1) + key = cols[0] + if len(cols) == 2: + text = cols[1].strip() + else: + text = '' + else: + text = l.strip() + + # cases + if args.to_upper and args.to_lower: + sys.stderr.write('cn_tn.py: to_upper OR to_lower?') + exit(1) + if args.to_upper: + text = text.upper() + if args.to_lower: + text = text.lower() + + # NSW(Non-Standard-Word) normalization + text = NSWNormalizer(text).normalize() + + # Punctuations removal + old_chars = CHINESE_PUNC_LIST + string.punctuation # includes all CN and EN punctuations + new_chars = ' ' * len(old_chars) + del_chars = '' + text = text.translate(str.maketrans(old_chars, new_chars, del_chars)) + + # + if args.has_key: + ofile.write(key + '\t' + text + '\n') + else: + if text.strip() != '': # skip empty line in pure text format(without Kaldi's utt key) + ofile.write(text + '\n') + + n += 1 + if n % args.log_interval == 0: + sys.stderr.write("cn_tn.py: {} lines done.\n".format(n)) + sys.stderr.flush() + + sys.stderr.write("cn_tn.py: {} lines done in total.\n".format(n)) + sys.stderr.flush() + + ifile.close() + ofile.close() diff --git a/third_party/chinese_text_normalization/python/example_kaldi.txt b/third_party/chinese_text_normalization/python/example_kaldi.txt new file mode 100644 index 000000000..07af5674b --- /dev/null +++ b/third_party/chinese_text_normalization/python/example_kaldi.txt @@ -0,0 +1,7 @@ +UTT000 这块黄金重达324.75克 +UTT001 她出生于86年8月18日,她弟弟出生于1995年3月1日 +UTT002 电影中梁朝伟扮演的陈永仁的编号27149 +UTT003 现场有7/12的观众投出了赞成票 +UTT004 随便来几个价格12块5,34.5元,20.1万 +UTT005 明天有62%的概率降雨 +UTT006 这是固话0421-33441122或这是手机+86 18544139121 diff --git a/third_party/chinese_text_normalization/python/example_plain.txt b/third_party/chinese_text_normalization/python/example_plain.txt new file mode 100644 index 000000000..14e5a09fe --- /dev/null +++ b/third_party/chinese_text_normalization/python/example_plain.txt @@ -0,0 +1,7 @@ +这块黄金重达324.75克 +她出生于86年8月18日,她弟弟出生于1995年3月1日 +电影中梁朝伟扮演的陈永仁的编号27149 +现场有7/12的观众投出了赞成票 +随便来几个价格12块5,34.5元,20.1万 +明天有62%的概率降雨 +这是固话0421-33441122或这是手机+86 18544139121 diff --git a/third_party/chinese_text_normalization/python/run.sh b/third_party/chinese_text_normalization/python/run.sh new file mode 100644 index 000000000..0866d72f0 --- /dev/null +++ b/third_party/chinese_text_normalization/python/run.sh @@ -0,0 +1,8 @@ +# for plain text +python3 cn_tn.py example_plain.txt output_plain.txt +diff example_plain.txt output_plain.txt + +# for Kaldi's trans format +python3 cn_tn.py --has_key example_kaldi.txt output_kaldi.txt +diff example_kaldi.txt output_kaldi.txt + diff --git a/third_party/chinese_text_normalization/thrax/INSTALL.txt b/third_party/chinese_text_normalization/thrax/INSTALL.txt new file mode 100644 index 000000000..dcbd58c50 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/INSTALL.txt @@ -0,0 +1,24 @@ +0. place install_thrax.sh into $KALDI/tools/extras/ + +1. recompile openfst with necessary option "--enable-grm" to support thrax: +* cd $KALDI_ROOT/tools +* make clean +* edit $KALDI_ROOT/tools/Makefile, append "--enable-grm" option to OPENFST_CONFIGURE: +OPENFST_CONFIGURE ?= --enable-static --enable-shared --enable-far --enable-ngram-fsts --enable-lookahead-fsts --with-pic --enable-grm +* make -j 10 + +2. install thrax +cd $KALDI_ROOT/tools +sh extras/install_thrax.sh + +3. add thrax binary path into $KALDI_ROOT/tools/env.sh: +export PATH=/path/to/your/kaldi_root/tools/thrax-1.2.9/src/bin:${PATH} + +usage: +before you run anything related to thrax, use: +. $KALDI_ROOT/tools/env.sh +to enable binary finding, like what we always do in kaldi. + +sample usage: +sh run_en.sh +sh run_cn.sh diff --git a/third_party/chinese_text_normalization/thrax/install_thrax.sh b/third_party/chinese_text_normalization/thrax/install_thrax.sh new file mode 100755 index 000000000..20d2757b9 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/install_thrax.sh @@ -0,0 +1,12 @@ +#!/bin/bash +## This script should be placed under $KALDI_ROOT/tools/extras/, and see INSTALL.txt for installation guide +if [ ! -f thrax-1.2.9.tar.gz ]; then + wget http://www.openfst.org/twiki/pub/GRM/ThraxDownload/thrax-1.2.9.tar.gz + tar -zxf thrax-1.2.9.tar.gz +fi +cd thrax-1.2.9 +OPENFSTPREFIX=`pwd`/../openfst +LDFLAGS="-L${OPENFSTPREFIX}/lib" CXXFLAGS="-I${OPENFSTPREFIX}/include" ./configure --prefix ${OPENFSTPREFIX} +make -j 10; make install +cd .. + diff --git a/third_party/chinese_text_normalization/thrax/papers/gorman-sproat-2016.pdf b/third_party/chinese_text_normalization/thrax/papers/gorman-sproat-2016.pdf new file mode 100644 index 000000000..14a438c7f Binary files /dev/null and b/third_party/chinese_text_normalization/thrax/papers/gorman-sproat-2016.pdf differ diff --git a/third_party/chinese_text_normalization/thrax/papers/wu-etal-2016.pdf b/third_party/chinese_text_normalization/thrax/papers/wu-etal-2016.pdf new file mode 100644 index 000000000..c7d1068fe Binary files /dev/null and b/third_party/chinese_text_normalization/thrax/papers/wu-etal-2016.pdf differ diff --git a/third_party/chinese_text_normalization/thrax/run_cn.sh b/third_party/chinese_text_normalization/thrax/run_cn.sh new file mode 100644 index 000000000..81bb2893e --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/run_cn.sh @@ -0,0 +1,6 @@ +cd src/cn +thraxmakedep itn.grm +make +#thraxrewrite-tester --far=itn.far --rules=ITN +cat ../../testcase_cn.txt | thraxrewrite-tester --far=itn.far --rules=ITN +cd - diff --git a/third_party/chinese_text_normalization/thrax/run_en.sh b/third_party/chinese_text_normalization/thrax/run_en.sh new file mode 100644 index 000000000..f8526487d --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/run_en.sh @@ -0,0 +1,6 @@ +cd src +thraxmakedep en/verbalizer/podspeech.grm +make +cat ../testcase_en.txt +cat ../testcase_en.txt | thraxrewrite-tester --far=en/verbalizer/podspeech.far --rules=POD_SPEECH_TN +cd - diff --git a/third_party/chinese_text_normalization/thrax/src/LICENSE b/third_party/chinese_text_normalization/thrax/src/LICENSE new file mode 100644 index 000000000..d64569567 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/LICENSE @@ -0,0 +1,202 @@ + + Apache License + Version 2.0, January 2004 + http://www.apache.org/licenses/ + + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + + 1. Definitions. + + "License" shall mean the terms and conditions for use, reproduction, + and distribution as defined by Sections 1 through 9 of this document. + + "Licensor" shall mean the copyright owner or entity authorized by + the copyright owner that is granting the License. + + "Legal Entity" shall mean the union of the acting entity and all + other entities that control, are controlled by, or are under common + control with that entity. For the purposes of this definition, + "control" means (i) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract or + otherwise, or (ii) ownership of fifty percent (50%) or more of the + outstanding shares, or (iii) beneficial ownership of such entity. + + "You" (or "Your") shall mean an individual or Legal Entity + exercising permissions granted by this License. + + "Source" form shall mean the preferred form for making modifications, + including but not limited to software source code, documentation + source, and configuration files. + + "Object" form shall mean any form resulting from mechanical + transformation or translation of a Source form, including but + not limited to compiled object code, generated documentation, + and conversions to other media types. + + "Work" shall mean the work of authorship, whether in Source or + Object form, made available under the License, as indicated by a + copyright notice that is included in or attached to the work + (an example is provided in the Appendix below). + + "Derivative Works" shall mean any work, whether in Source or Object + form, that is based on (or derived from) the Work and for which the + editorial revisions, annotations, elaborations, or other modifications + represent, as a whole, an original work of authorship. For the purposes + of this License, Derivative Works shall not include works that remain + separable from, or merely link (or bind by name) to the interfaces of, + the Work and Derivative Works thereof. + + "Contribution" shall mean any work of authorship, including + the original version of the Work and any modifications or additions + to that Work or Derivative Works thereof, that is intentionally + submitted to Licensor for inclusion in the Work by the copyright owner + or by an individual or Legal Entity authorized to submit on behalf of + the copyright owner. For the purposes of this definition, "submitted" + means any form of electronic, verbal, or written communication sent + to the Licensor or its representatives, including but not limited to + communication on electronic mailing lists, source code control systems, + and issue tracking systems that are managed by, or on behalf of, the + Licensor for the purpose of discussing and improving the Work, but + excluding communication that is conspicuously marked or otherwise + designated in writing by the copyright owner as "Not a Contribution." + + "Contributor" shall mean Licensor and any individual or Legal Entity + on behalf of whom a Contribution has been received by Licensor and + subsequently incorporated within the Work. + + 2. Grant of Copyright License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + copyright license to reproduce, prepare Derivative Works of, + publicly display, publicly perform, sublicense, and distribute the + Work and such Derivative Works in Source or Object form. + + 3. Grant of Patent License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + (except as stated in this section) patent license to make, have made, + use, offer to sell, sell, import, and otherwise transfer the Work, + where such license applies only to those patent claims licensable + by such Contributor that are necessarily infringed by their + Contribution(s) alone or by combination of their Contribution(s) + with the Work to which such Contribution(s) was submitted. If You + institute patent litigation against any entity (including a + cross-claim or counterclaim in a lawsuit) alleging that the Work + or a Contribution incorporated within the Work constitutes direct + or contributory patent infringement, then any patent licenses + granted to You under this License for that Work shall terminate + as of the date such litigation is filed. + + 4. Redistribution. You may reproduce and distribute copies of the + Work or Derivative Works thereof in any medium, with or without + modifications, and in Source or Object form, provided that You + meet the following conditions: + + (a) You must give any other recipients of the Work or + Derivative Works a copy of this License; and + + (b) You must cause any modified files to carry prominent notices + stating that You changed the files; and + + (c) You must retain, in the Source form of any Derivative Works + that You distribute, all copyright, patent, trademark, and + attribution notices from the Source form of the Work, + excluding those notices that do not pertain to any part of + the Derivative Works; and + + (d) If the Work includes a "NOTICE" text file as part of its + distribution, then any Derivative Works that You distribute must + include a readable copy of the attribution notices contained + within such NOTICE file, excluding those notices that do not + pertain to any part of the Derivative Works, in at least one + of the following places: within a NOTICE text file distributed + as part of the Derivative Works; within the Source form or + documentation, if provided along with the Derivative Works; or, + within a display generated by the Derivative Works, if and + wherever such third-party notices normally appear. The contents + of the NOTICE file are for informational purposes only and + do not modify the License. You may add Your own attribution + notices within Derivative Works that You distribute, alongside + or as an addendum to the NOTICE text from the Work, provided + that such additional attribution notices cannot be construed + as modifying the License. + + You may add Your own copyright statement to Your modifications and + may provide additional or different license terms and conditions + for use, reproduction, or distribution of Your modifications, or + for any such Derivative Works as a whole, provided Your use, + reproduction, and distribution of the Work otherwise complies with + the conditions stated in this License. + + 5. Submission of Contributions. Unless You explicitly state otherwise, + any Contribution intentionally submitted for inclusion in the Work + by You to the Licensor shall be under the terms and conditions of + this License, without any additional terms or conditions. + Notwithstanding the above, nothing herein shall supersede or modify + the terms of any separate license agreement you may have executed + with Licensor regarding such Contributions. + + 6. Trademarks. This License does not grant permission to use the trade + names, trademarks, service marks, or product names of the Licensor, + except as required for reasonable and customary use in describing the + origin of the Work and reproducing the content of the NOTICE file. + + 7. Disclaimer of Warranty. Unless required by applicable law or + agreed to in writing, Licensor provides the Work (and each + Contributor provides its Contributions) on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied, including, without limitation, any warranties or conditions + of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A + PARTICULAR PURPOSE. You are solely responsible for determining the + appropriateness of using or redistributing the Work and assume any + risks associated with Your exercise of permissions under this License. + + 8. Limitation of Liability. In no event and under no legal theory, + whether in tort (including negligence), contract, or otherwise, + unless required by applicable law (such as deliberate and grossly + negligent acts) or agreed to in writing, shall any Contributor be + liable to You for damages, including any direct, indirect, special, + incidental, or consequential damages of any character arising as a + result of this License or out of the use or inability to use the + Work (including but not limited to damages for loss of goodwill, + work stoppage, computer failure or malfunction, or any and all + other commercial damages or losses), even if such Contributor + has been advised of the possibility of such damages. + + 9. Accepting Warranty or Additional Liability. While redistributing + the Work or Derivative Works thereof, You may choose to offer, + and charge a fee for, acceptance of support, warranty, indemnity, + or other liability obligations and/or rights consistent with this + License. However, in accepting such obligations, You may act only + on Your own behalf and on Your sole responsibility, not on behalf + of any other Contributor, and only if You agree to indemnify, + defend, and hold each Contributor harmless for any liability + incurred by, or claims asserted against, such Contributor by reason + of your accepting any such warranty or additional liability. + + END OF TERMS AND CONDITIONS + + APPENDIX: How to apply the Apache License to your work. + + To apply the Apache License to your work, attach the following + boilerplate notice, with the fields enclosed by brackets "[]" + replaced with your own identifying information. (Don't include + the brackets!) The text should be enclosed in the appropriate + comment syntax for the file format. We also recommend that a + file or class name and description of purpose be included on the + same "printed page" as the copyright notice for easier + identification within third-party archives. + + Copyright [yyyy] [name of copyright owner] + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. diff --git a/third_party/chinese_text_normalization/thrax/src/Makefile b/third_party/chinese_text_normalization/thrax/src/Makefile new file mode 100644 index 000000000..6937ab5f7 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/Makefile @@ -0,0 +1,65 @@ +en/verbalizer/podspeech.far: en/verbalizer/podspeech.grm util/util.far util/case.far en/verbalizer/extra_numbers.far en/verbalizer/float.far en/verbalizer/math.far en/verbalizer/miscellaneous.far en/verbalizer/money.far en/verbalizer/numbers.far en/verbalizer/numbers_plus.far en/verbalizer/spelled.far en/verbalizer/spoken_punct.far en/verbalizer/time.far en/verbalizer/urls.far + thraxcompiler --input_grammar=$< --output_far=$@ + +util/util.far: util/util.grm util/byte.far util/case.far + thraxcompiler --input_grammar=$< --output_far=$@ + +util/byte.far: util/byte.grm + thraxcompiler --input_grammar=$< --output_far=$@ + +util/case.far: util/case.grm util/byte.far + thraxcompiler --input_grammar=$< --output_far=$@ + +en/verbalizer/extra_numbers.far: en/verbalizer/extra_numbers.grm util/byte.far en/verbalizer/numbers.far + thraxcompiler --input_grammar=$< --output_far=$@ + +en/verbalizer/numbers.far: en/verbalizer/numbers.grm en/verbalizer/number_names.far util/byte.far universal/thousands_punct.far + thraxcompiler --input_grammar=$< --output_far=$@ + +en/verbalizer/number_names.far: en/verbalizer/number_names.grm util/arithmetic.far en/verbalizer/g.fst en/verbalizer/cardinals.tsv en/verbalizer/ordinals.tsv + thraxcompiler --input_grammar=$< --output_far=$@ + +util/arithmetic.far: util/arithmetic.grm util/byte.far util/germanic.tsv + thraxcompiler --input_grammar=$< --output_far=$@ + +universal/thousands_punct.far: universal/thousands_punct.grm util/byte.far util/util.far + thraxcompiler --input_grammar=$< --output_far=$@ + +en/verbalizer/float.far: en/verbalizer/float.grm en/verbalizer/factorization.far en/verbalizer/lexical_map.far en/verbalizer/numbers.far + thraxcompiler --input_grammar=$< --output_far=$@ + +en/verbalizer/factorization.far: en/verbalizer/factorization.grm util/byte.far util/util.far en/verbalizer/numbers.far + thraxcompiler --input_grammar=$< --output_far=$@ + +en/verbalizer/lexical_map.far: en/verbalizer/lexical_map.grm util/byte.far en/verbalizer/lexical_map.tsv + thraxcompiler --input_grammar=$< --output_far=$@ + +en/verbalizer/math.far: en/verbalizer/math.grm en/verbalizer/float.far en/verbalizer/lexical_map.far en/verbalizer/numbers.far + thraxcompiler --input_grammar=$< --output_far=$@ + +en/verbalizer/miscellaneous.far: en/verbalizer/miscellaneous.grm util/byte.far ru/classifier/cyrillic.far en/verbalizer/extra_numbers.far en/verbalizer/lexical_map.far en/verbalizer/numbers.far en/verbalizer/spelled.far + thraxcompiler --input_grammar=$< --output_far=$@ + +ru/classifier/cyrillic.far: ru/classifier/cyrillic.grm + thraxcompiler --input_grammar=$< --output_far=$@ + +en/verbalizer/spelled.far: en/verbalizer/spelled.grm util/byte.far ru/classifier/cyrillic.far en/verbalizer/lexical_map.far en/verbalizer/numbers.far + thraxcompiler --input_grammar=$< --output_far=$@ + +en/verbalizer/money.far: en/verbalizer/money.grm util/byte.far en/verbalizer/lexical_map.far en/verbalizer/numbers.far en/verbalizer/money.tsv + thraxcompiler --input_grammar=$< --output_far=$@ + +en/verbalizer/numbers_plus.far: en/verbalizer/numbers_plus.grm en/verbalizer/factorization.far en/verbalizer/lexical_map.far en/verbalizer/numbers.far + thraxcompiler --input_grammar=$< --output_far=$@ + +en/verbalizer/spoken_punct.far: en/verbalizer/spoken_punct.grm en/verbalizer/lexical_map.far + thraxcompiler --input_grammar=$< --output_far=$@ + +en/verbalizer/time.far: en/verbalizer/time.grm util/byte.far en/verbalizer/lexical_map.far en/verbalizer/numbers.far + thraxcompiler --input_grammar=$< --output_far=$@ + +en/verbalizer/urls.far: en/verbalizer/urls.grm util/byte.far en/verbalizer/lexical_map.far + thraxcompiler --input_grammar=$< --output_far=$@ + +clean: + rm -f util/util.far util/case.far en/verbalizer/extra_numbers.far en/verbalizer/float.far en/verbalizer/math.far en/verbalizer/miscellaneous.far en/verbalizer/money.far en/verbalizer/numbers.far en/verbalizer/numbers_plus.far en/verbalizer/spelled.far en/verbalizer/spoken_punct.far en/verbalizer/time.far en/verbalizer/urls.far util/byte.far en/verbalizer/number_names.far universal/thousands_punct.far util/arithmetic.far en/verbalizer/factorization.far en/verbalizer/lexical_map.far ru/classifier/cyrillic.far diff --git a/third_party/chinese_text_normalization/thrax/src/README.md b/third_party/chinese_text_normalization/thrax/src/README.md new file mode 100644 index 000000000..a7b2b0242 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/README.md @@ -0,0 +1,24 @@ +# Text normalization covering grammars + +This repository provides covering grammars for English and Russian text normalization as +documented in: + + Gorman, K., and Sproat, R. 2016. Minimally supervised number normalization. + _Transactions of the Association for Computational Linguistics_ 4: 507-519. + + Ng, A. H., Gorman, K., and Sproat, R. 2017. Minimally supervised + written-to-spoken text normalization. In _ASRU_, pages 665-670. + +If you use these grammars in a publication, we would appreciate if you cite these works. + +## Building + +The grammars are written in [Thrax](thrax.opengrm.org) and compile into [OpenFst](openfst.org) FAR (FstARchive) files. To compile, simply run `make` in the `src/` directory. + +## License + +See `LICENSE`. + +## Mandatory disclaimer + +This is not an official Google product. diff --git a/third_party/chinese_text_normalization/thrax/src/cn/Makefile b/third_party/chinese_text_normalization/thrax/src/cn/Makefile new file mode 100644 index 000000000..2ff2d74ae --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/cn/Makefile @@ -0,0 +1,23 @@ +itn.far: itn.grm byte.far number.far hotfix.far percentage.far date.far amount.far + thraxcompiler --input_grammar=$< --output_far=$@ + +byte.far: byte.grm + thraxcompiler --input_grammar=$< --output_far=$@ + +number.far: number.grm byte.far + thraxcompiler --input_grammar=$< --output_far=$@ + +hotfix.far: hotfix.grm byte.far hotfix.list + thraxcompiler --input_grammar=$< --output_far=$@ + +percentage.far: percentage.grm byte.far number.far + thraxcompiler --input_grammar=$< --output_far=$@ + +date.far: date.grm byte.far number.far + thraxcompiler --input_grammar=$< --output_far=$@ + +amount.far: amount.grm byte.far number.far + thraxcompiler --input_grammar=$< --output_far=$@ + +clean: + rm -f byte.far number.far hotfix.far percentage.far date.far amount.far diff --git a/third_party/chinese_text_normalization/thrax/src/cn/amount.grm b/third_party/chinese_text_normalization/thrax/src/cn/amount.grm new file mode 100644 index 000000000..a83b3bee2 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/cn/amount.grm @@ -0,0 +1,24 @@ +import 'byte.grm' as b; +import 'number.grm' as n; + +unit = ( + "匹"|"张"|"座"|"回"|"场"|"尾"|"条"|"个"|"首"|"阙"|"阵"|"网"|"炮"| + "顶"|"丘"|"棵"|"只"|"支"|"袭"|"辆"|"挑"|"担"|"颗"|"壳"|"窠"|"曲"| + "墙"|"群"|"腔"|"砣"|"座"|"客"|"贯"|"扎"|"捆"|"刀"|"令"|"打"|"手"| + "罗"|"坡"|"山"|"岭"|"江"|"溪"|"钟"|"队"|"单"|"双"|"对"|"出"|"口"| + "头"|"脚"|"板"|"跳"|"枝"|"件"|"贴"|"针"|"线"|"管"|"名"|"位"|"身"| + "堂"|"课"|"本"|"页"|"家"|"户"|"层"|"丝"|"毫"|"厘"|"分"|"钱"|"两"| + "斤"|"担"|"铢"|"石"|"钧"|"锱"|"忽"|"毫"|"厘"|"分"|"寸"|"尺"|"丈"| + "里"|"寻"|"常"|"铺"|"程"|"撮"|"勺"|"合"|"升"|"斗"|"石"|"盘"|"碗"| + "碟"|"叠"|"桶"|"笼"|"盆"|"盒"|"杯"|"钟"|"斛"|"锅"|"簋"|"篮"|"盘"| + "桶"|"罐"|"瓶"|"壶"|"卮"|"盏"|"箩"|"箱"|"煲"|"啖"|"袋"|"钵"|"年"| + "月"|"日"|"季"|"刻"|"时"|"周"|"天"|"秒"|"分"|"旬"|"纪"|"岁"|"世"| + "更"|"夜"|"春"|"夏"|"秋"|"冬"|"代"|"伏"|"辈"|"丸"|"泡"|"粒"|"颗"| + "幢"|"堆"|"条"|"根"|"支"|"道"|"面"|"片"|"张"|"颗"|"块"| + (("千克":"kg")|("毫克":"mg")|("微克":"µg"))| + (("千米":"km")|("厘米":"cm")|("毫米":"mm")|("微米":"µm")|("纳米":"nm")) +); + +amount = n.number unit; +export AMOUNT = CDRewrite[amount, "", "", b.kBytes*]; + diff --git a/third_party/chinese_text_normalization/thrax/src/cn/byte.grm b/third_party/chinese_text_normalization/thrax/src/cn/byte.grm new file mode 100644 index 000000000..f23337344 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/cn/byte.grm @@ -0,0 +1,76 @@ +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +# Copyright 2005-2011 Google, Inc. +# Author: ttai@google.com (Terry Tai) + +# Standard constants for ASCII (byte) based strings. This mirrors the +# functions provided by C/C++'s ctype.h library. + +# Note that [0] is missing. Matching the string-termination character is kinda weird. +export kBytes = Optimize[ + "[1]" | "[2]" | "[3]" | "[4]" | "[5]" | "[6]" | "[7]" | "[8]" | "[9]" | "[10]" | + "[11]" | "[12]" | "[13]" | "[14]" | "[15]" | "[16]" | "[17]" | "[18]" | "[19]" | "[20]" | + "[21]" | "[22]" | "[23]" | "[24]" | "[25]" | "[26]" | "[27]" | "[28]" | "[29]" | "[30]" | + "[31]" | "[32]" | "[33]" | "[34]" | "[35]" | "[36]" | "[37]" | "[38]" | "[39]" | "[40]" | + "[41]" | "[42]" | "[43]" | "[44]" | "[45]" | "[46]" | "[47]" | "[48]" | "[49]" | "[50]" | + "[51]" | "[52]" | "[53]" | "[54]" | "[55]" | "[56]" | "[57]" | "[58]" | "[59]" | "[60]" | + "[61]" | "[62]" | "[63]" | "[64]" | "[65]" | "[66]" | "[67]" | "[68]" | "[69]" | "[70]" | + "[71]" | "[72]" | "[73]" | "[74]" | "[75]" | "[76]" | "[77]" | "[78]" | "[79]" | "[80]" | + "[81]" | "[82]" | "[83]" | "[84]" | "[85]" | "[86]" | "[87]" | "[88]" | "[89]" | "[90]" | + "[91]" | "[92]" | "[93]" | "[94]" | "[95]" | "[96]" | "[97]" | "[98]" | "[99]" | "[100]" | +"[101]" | "[102]" | "[103]" | "[104]" | "[105]" | "[106]" | "[107]" | "[108]" | "[109]" | "[110]" | +"[111]" | "[112]" | "[113]" | "[114]" | "[115]" | "[116]" | "[117]" | "[118]" | "[119]" | "[120]" | +"[121]" | "[122]" | "[123]" | "[124]" | "[125]" | "[126]" | "[127]" | "[128]" | "[129]" | "[130]" | +"[131]" | "[132]" | "[133]" | "[134]" | "[135]" | "[136]" | "[137]" | "[138]" | "[139]" | "[140]" | +"[141]" | "[142]" | "[143]" | "[144]" | "[145]" | "[146]" | "[147]" | "[148]" | "[149]" | "[150]" | +"[151]" | "[152]" | "[153]" | "[154]" | "[155]" | "[156]" | "[157]" | "[158]" | "[159]" | "[160]" | +"[161]" | "[162]" | "[163]" | "[164]" | "[165]" | "[166]" | "[167]" | "[168]" | "[169]" | "[170]" | +"[171]" | "[172]" | "[173]" | "[174]" | "[175]" | "[176]" | "[177]" | "[178]" | "[179]" | "[180]" | +"[181]" | "[182]" | "[183]" | "[184]" | "[185]" | "[186]" | "[187]" | "[188]" | "[189]" | "[190]" | +"[191]" | "[192]" | "[193]" | "[194]" | "[195]" | "[196]" | "[197]" | "[198]" | "[199]" | "[200]" | +"[201]" | "[202]" | "[203]" | "[204]" | "[205]" | "[206]" | "[207]" | "[208]" | "[209]" | "[210]" | +"[211]" | "[212]" | "[213]" | "[214]" | "[215]" | "[216]" | "[217]" | "[218]" | "[219]" | "[220]" | +"[221]" | "[222]" | "[223]" | "[224]" | "[225]" | "[226]" | "[227]" | "[228]" | "[229]" | "[230]" | +"[231]" | "[232]" | "[233]" | "[234]" | "[235]" | "[236]" | "[237]" | "[238]" | "[239]" | "[240]" | +"[241]" | "[242]" | "[243]" | "[244]" | "[245]" | "[246]" | "[247]" | "[248]" | "[249]" | "[250]" | +"[251]" | "[252]" | "[253]" | "[254]" | "[255]" +]; + +export kDigit = Optimize[ + "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" +]; + +export kLower = Optimize[ + "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" | "k" | "l" | "m" | + "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z" +]; +export kUpper = Optimize[ + "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K" | "L" | "M" | + "N" | "O" | "P" | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" +]; +export kAlpha = Optimize[kLower | kUpper]; + +export kAlnum = Optimize[kDigit | kAlpha]; + +export kSpace = Optimize[ + " " | "\t" | "\n" | "\r" +]; +export kNotSpace = Optimize[kBytes - kSpace]; + +export kPunct = Optimize[ + "!" | "\"" | "#" | "$" | "%" | "&" | "'" | "(" | ")" | "*" | "+" | "," | + "-" | "." | "/" | ":" | ";" | "<" | "=" | ">" | "?" | "@" | "\[" | "\\" | + "\]" | "^" | "_" | "`" | "{" | "|" | "}" | "~" +]; + +export kGraph = Optimize[kAlnum | kPunct]; diff --git a/third_party/chinese_text_normalization/thrax/src/cn/date.grm b/third_party/chinese_text_normalization/thrax/src/cn/date.grm new file mode 100644 index 000000000..546937383 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/cn/date.grm @@ -0,0 +1,10 @@ +import 'byte.grm' as b; +import 'number.grm' as n; + +date_day = n.number_1_to_99 ("日"|"号"); +date_month_day = n.number_1_to_99 "月" date_day; +date_year_month_day = ((n.number_0_to_9){2,4} | n.number) "年" date_month_day; + +date = date_year_month_day | date_month_day | date_day; + +export DATE = CDRewrite[date, "", "", b.kBytes*]; diff --git a/third_party/chinese_text_normalization/thrax/src/cn/hotfix.grm b/third_party/chinese_text_normalization/thrax/src/cn/hotfix.grm new file mode 100644 index 000000000..f1a43cdf2 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/cn/hotfix.grm @@ -0,0 +1,5 @@ +import 'byte.grm' as b; +hotfix = StringFile['hotfix.list']; + +export HOTFIX = CDRewrite[hotfix, "", "", b.kBytes*]; + diff --git a/third_party/chinese_text_normalization/thrax/src/cn/hotfix.list b/third_party/chinese_text_normalization/thrax/src/cn/hotfix.list new file mode 100644 index 000000000..7234996e9 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/cn/hotfix.list @@ -0,0 +1,18 @@ +0头 零头 +10字 十字 +东4环 东4环 -1.0 +东4 东四 -0.5 +4惠 四惠 +3元桥 三元桥 +4平市 四平市 +5台山 五台山 +西2旗 西二旗 +西3旗 西三旗 +4道口 四道口 -1.0 +5道口 五道口 -1.0 +6道口 六道口 -1.0 +6里桥 六里桥 +7里庄 七里庄 +8宝山 八宝山 +9颗松 九棵松 +10里堡 十里堡 diff --git a/third_party/chinese_text_normalization/thrax/src/cn/itn.grm b/third_party/chinese_text_normalization/thrax/src/cn/itn.grm new file mode 100644 index 000000000..709ce6c66 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/cn/itn.grm @@ -0,0 +1,9 @@ +import 'byte.grm' as b; +import 'number.grm' as number; +import 'hotfix.grm' as hotfix; +import 'percentage.grm' as percentage; +import 'date.grm' as date; +import 'amount.grm' as amount; # seems not useful for now + +export ITN = Optimize[percentage.PERCENTAGE @ (date.DATE <-1>) @ number.NUMBER @ hotfix.HOTFIX]; + diff --git a/third_party/chinese_text_normalization/thrax/src/cn/number.grm b/third_party/chinese_text_normalization/thrax/src/cn/number.grm new file mode 100644 index 000000000..1e9a86545 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/cn/number.grm @@ -0,0 +1,61 @@ +import 'byte.grm' as b; + +number_1_to_9 = ( + ("一":"1") | ("幺":"1") | + ("二":"2") | ("两":"2") | + ("三":"3") | + ("四":"4") | + ("五":"5") | + ("六":"6") | + ("七":"7") | + ("八":"8") | + ("九":"9") +); + +export number_0_to_9 = (("零":"0") | number_1_to_9); + +number_10_to_19 = ( + ("十":"10") | + ("十一":"11") | + ("十二":"12") | + ("十三":"13") | + ("十四":"14") | + ("十五":"15") | + ("十六":"16") | + ("十七":"17") | + ("十八":"18") | + ("十九":"19") +); + +number_10s = (number_1_to_9 ("十":"")); +number_100s = (number_1_to_9 ("百":"")); +number_1000s = (number_1_to_9 ("千":"")); +number_10000s = (number_1_to_9 ("万":"")); + +number_10_to_99 = ( + ((number_10s number_1_to_9)<-0.3>) | + ((number_10s ("":"0"))<-0.2>) | + (number_10_to_19 <-0.1>) +); + +export number_1_to_99 = (number_1_to_9 | number_10_to_99); + +number_100_to_999 = ( + ((number_100s ("零":"0") number_1_to_9)<0.0>)| + ((number_100s number_10_to_99)<0.0>) | + ((number_100s number_1_to_9 ("":"0"))<0.0>) | + ((number_100s ("":"00"))<0.1>) +); + +number_1000_to_9999 = ( + ((number_1000s number_100_to_999)<0.0>) | + ((number_1000s ("零":"0") number_10_to_99)<0.0>)| + ((number_1000s ("零":"00") number_1_to_9)<0.0>)| + ((number_1000s ("":"000"))<1>) | + ((number_1000s number_1_to_9 ("":"00"))<0.0>) +); + +export number = number_1_to_99 | (number_100_to_999 <-1>) | (number_1000_to_9999 <-2>); + +export NUMBER = CDRewrite[number, "", "", b.kBytes*]; + diff --git a/third_party/chinese_text_normalization/thrax/src/cn/percentage.grm b/third_party/chinese_text_normalization/thrax/src/cn/percentage.grm new file mode 100644 index 000000000..d9f92a36e --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/cn/percentage.grm @@ -0,0 +1,8 @@ +import 'byte.grm' as b; +import 'number.grm' as n; + +percentage = ( + ("百分之":"") n.number_1_to_99 ("":"%") +); + +export PERCENTAGE = CDRewrite[percentage, "", "", b.kBytes*]; diff --git a/third_party/chinese_text_normalization/thrax/src/en/README.md b/third_party/chinese_text_normalization/thrax/src/en/README.md new file mode 100644 index 000000000..8157e807c --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/en/README.md @@ -0,0 +1,6 @@ +# English covering grammar definitions + +This directory defines a English text normalization covering grammar. The +primary entry-point is the FST `VERBALIZER`, defined in +`verbalizer/verbalizer.grm` and compiled in the FST archive +`verbalizer/verbalizer.far`. diff --git a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/Makefile b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/Makefile new file mode 100644 index 000000000..6318dc546 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/Makefile @@ -0,0 +1,3 @@ +verbalizer.far: verbalizer.grm util/util.far en/verbalizer/extra_numbers.far en/verbalizer/float.far en/verbalizer/math.far en/verbalizer/miscellaneous.far en/verbalizer/money.far en/verbalizer/numbers.far en/verbalizer/numbers_plus.far en/verbalizer/spelled.far en/verbalizer/spoken_punct.far en/verbalizer/time.far en/verbalizer/urls.far + thraxcompiler --input_grammar=$< --output_far=$@ + diff --git a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/cardinals.tsv b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/cardinals.tsv new file mode 100644 index 000000000..b4704ff3e --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/cardinals.tsv @@ -0,0 +1,32 @@ +0 zero +1 one +2 two +3 three +4 four +5 five +6 six +7 seven +8 eight +9 nine +10 ten +11 eleven +12 twelve +13 thirteen +14 fourteen +15 fifteen +16 sixteen +17 seventeen +18 eighteen +19 nineteen +20 twenty +30 thirty +40 forty +50 fifty +60 sixty +70 seventy +80 eighty +90 ninety +100 hundred +1000 thousand +1000000 million +1000000000 billion diff --git a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/extra_numbers.grm b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/extra_numbers.grm new file mode 100644 index 000000000..a1fb370c4 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/extra_numbers.grm @@ -0,0 +1,35 @@ +# Copyright 2017 Google Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import 'util/byte.grm' as b; +import 'en/verbalizer/numbers.grm' as n; + +digit = b.kDigit @ n.CARDINAL_NUMBERS | ("0" : "@@OTHER_ZERO_VERBALIZATIONS@@"); + +export DIGITS = digit (n.I[" "] digit)*; + +# Various common factorizations + +two_digits = b.kDigit{2} @ n.CARDINAL_NUMBERS; + +three_digits = b.kDigit{3} @ n.CARDINAL_NUMBERS; + +mixed = + (digit n.I[" "] two_digits) + | (two_digits n.I[" "] two_digits) + | (two_digits n.I[" "] three_digits) + | (two_digits n.I[" "] two_digits n.I[" "] two_digits) +; + +export MIXED_NUMBERS = Optimize[mixed]; diff --git a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/factorization.grm b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/factorization.grm new file mode 100644 index 000000000..22ecfa9f4 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/factorization.grm @@ -0,0 +1,40 @@ +# Copyright 2017 Google Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import 'util/byte.grm' as b; +import 'util/util.grm' as u; +import 'en/verbalizer/numbers.grm' as n; + +func ToNumberName[expr] { + number_name_seq = n.CARDINAL_NUMBERS (" " n.CARDINAL_NUMBERS)*; + return Optimize[expr @ number_name_seq]; +} + +d = b.kDigit; + +leading_zero = CDRewrite[n.I[" "], ("[BOS]" | " ") "0", "", b.kBytes*]; + +by_ones = d n.I[" "]; +by_twos = (d{2} @ leading_zero) n.I[" "]; +by_threes = (d{3} @ leading_zero) n.I[" "]; + +groupings = by_twos* (by_threes | by_twos | by_ones); + +export FRACTIONAL_PART_UNGROUPED = + Optimize[ToNumberName[by_ones+ @ u.CLEAN_SPACES]] +; +export FRACTIONAL_PART_GROUPED = + Optimize[ToNumberName[groupings @ u.CLEAN_SPACES]] +; +export FRACTIONAL_PART_UNPARSED = Optimize[ToNumberName[d*]]; diff --git a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/float.grm b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/float.grm new file mode 100644 index 000000000..00b7ea376 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/float.grm @@ -0,0 +1,30 @@ +# Copyright 2017 Google Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import 'en/verbalizer/factorization.grm' as f; +import 'en/verbalizer/lexical_map.grm' as l; +import 'en/verbalizer/numbers.grm' as n; + +fractional_part_ungrouped = f.FRACTIONAL_PART_UNGROUPED; +fractional_part_grouped = f.FRACTIONAL_PART_GROUPED; +fractional_part_unparsed = f.FRACTIONAL_PART_UNPARSED; + +__fractional_part__ = fractional_part_ungrouped | fractional_part_unparsed; +__decimal_marker__ = "."; + +export FLOAT = Optimize[ + (n.CARDINAL_NUMBERS + (__decimal_marker__ : " @@DECIMAL_DOT_EXPRESSION@@ ") + __fractional_part__) @ l.LEXICAL_MAP] +; diff --git a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/g.fst b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/g.fst new file mode 100644 index 000000000..135da015c Binary files /dev/null and b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/g.fst differ diff --git a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/lexical_map.grm b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/lexical_map.grm new file mode 100644 index 000000000..a9b4ea490 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/lexical_map.grm @@ -0,0 +1,25 @@ +# Copyright 2017 Google Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import 'util/byte.grm' as b; + +lexical_map = StringFile['en/verbalizer/lexical_map.tsv']; + +sigma_star = b.kBytes*; + +del_null = CDRewrite["__NULL__" : "", "", "", sigma_star]; + +export LEXICAL_MAP = Optimize[ + CDRewrite[lexical_map, "", "", sigma_star] @ del_null] +; diff --git a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/lexical_map.tsv b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/lexical_map.tsv new file mode 100644 index 000000000..1e17034d8 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/lexical_map.tsv @@ -0,0 +1,74 @@ +@@CONNECTOR_RANGE@@ to +@@CONNECTOR_RATIO@@ to +@@CONNECTOR_BY@@ by +@@CONNECTOR_CONSECUTIVE_YEAR@@ to +@@JANUARY@@ january +@@FEBRUARY@@ february +@@MARCH@@ march +@@APRIL@@ april +@@MAY@@ may +@@JUNE@@ june +@@JULY@@ july +@@AUGUST@@ august +@@SEPTEMBER@@ september +@@OCTOBER@@ october +@@NOVEMBER@@ november +@@DECEMBER@@ december +@@MINUS@@ minus +@@DECIMAL_DOT_EXPRESSION@@ point +@@URL_DOT_EXPRESSION@@ dot +@@DECIMAL_EXPONENT@@ to the +@@DECIMAL_EXPONENT@@ to the power of +@@COLON@@ colon +@@SLASH@@ slash +@@SLASH@@ forward slash +@@DASH@@ dash +@@PASSWORD@@ password +@@AT@@ at +@@PORT@@ port +@@QUESTION_MARK@@ question mark +@@HASH@@ hash +@@HASH@@ hash tag +@@FRACTION_OVER@@ over +@@MONEY_AND@@ and +@@AND@@ and +@@PHONE_PLUS@@ plus +@@PHONE_EXTENSION@@ extension +@@TIME_AM@@ a m +@@TIME_PM@@ p m +@@HOUR@@ o'clock +@@MINUTE@@ minute +@@MINUTE@@ minutes +@@TIME_AFTER@@ after +@@TIME_AFTER@@ past +@@TIME_BEFORE@@ to +@@TIME_BEFORE@@ till +@@TIME_QUARTER@@ quarter +@@TIME_HALF@@ half +@@TIME_ZERO@@ oh +@@TIME_THREE_QUARTER@@ three quarters +@@ARITHMETIC_PLUS@@ plus +@@ARITHMETIC_TIMES@@ times +@@ARITHMETIC_TIMES@@ multiplied by +@@ARITHMETIC_MINUS@@ minus +@@ARITHMETIC_DIVISION@@ divided by +@@ARITHMETIC_DIVISION@@ over +@@ARITHMETIC_EQUALS@@ equals +@@PERCENT@@ percent +@@DEGREE@@ degree +@@DEGREE@@ degrees +@@SQUARE_ROOT@@ square root of +@@SQUARE_ROOT@@ the square root of +@@STAR@@ star +@@HYPHEN@@ hyphen +@@AT@@ at +@@PER@@ per +@@PERIOD@@ period +@@PERIOD@@ full stop +@@PERIOD@@ dot +@@EXCLAMATION_MARK@@ exclamation mark +@@EXCLAMATION_MARK@@ exclamation point +@@COMMA@@ comma +@@POSITIVE@@ positive +@@NEGATIVE@@ negative +@@OTHER_ZERO_VERBALIZATIONS@@ oh diff --git a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/math.grm b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/math.grm new file mode 100644 index 000000000..764e6e02e --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/math.grm @@ -0,0 +1,34 @@ +# Copyright 2017 Google Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import 'en/verbalizer/float.grm' as f; +import 'en/verbalizer/lexical_map.grm' as l; +import 'en/verbalizer/numbers.grm' as n; + +float = f.FLOAT; +card = n.CARDINAL_NUMBERS; +number = card | float; + +plus = "+" : " @@ARITHMETIC_PLUS@@ "; +times = "*" : " @@ARITHMETIC_TIMES@@ "; +minus = "-" : " @@ARITHMETIC_MINUS@@ "; +division = "/" : " @@ARITHMETIC_DIVISION@@ "; + +operator = plus | times | minus | division; + +percent = "%" : " @@PERCENT@@"; + +export ARITHMETIC = + Optimize[((number operator number) | (number percent)) @ l.LEXICAL_MAP] +; diff --git a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/miscellaneous.grm b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/miscellaneous.grm new file mode 100644 index 000000000..3a087d95c --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/miscellaneous.grm @@ -0,0 +1,78 @@ +# Copyright 2017 Google Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import 'util/byte.grm' as b; +import 'ru/classifier/cyrillic.grm' as c; +import 'en/verbalizer/extra_numbers.grm' as e; +import 'en/verbalizer/lexical_map.grm' as l; +import 'en/verbalizer/numbers.grm' as n; +import 'en/verbalizer/spelled.grm' as s; + +letter = b.kAlpha | c.kCyrillicAlpha; +dash = "-"; +word = letter+; +possibly_split_word = word (((dash | ".") : " ") word)* n.D["."]?; + +post_word_symbol = + ("+" : ("@@ARITHMETIC_PLUS@@" | "@@POSITIVE@@")) | + ("-" : ("@@ARITHMETIC_MINUS@@" | "@@NEGATIVE@@")) | + ("*" : "@@STAR@@") +; + +pre_word_symbol = + ("@" : "@@AT@@") | + ("/" : "@@SLASH@@") | + ("#" : "@@HASH@@") +; + +post_word = possibly_split_word n.I[" "] post_word_symbol; + +pre_word = pre_word_symbol n.I[" "] possibly_split_word; + +## Number/digit sequence combos, maybe with a dash + +spelled_word = word @ s.SPELLED_NO_LETTER; + +word_number = + (word | spelled_word) + (n.I[" "] | (dash : " ")) + (e.DIGITS | n.CARDINAL_NUMBERS | e.MIXED_NUMBERS) +; + +number_word = + (e.DIGITS | n.CARDINAL_NUMBERS | e.MIXED_NUMBERS) + (n.I[" "] | (dash : " ")) + (word | spelled_word) +; + +## Two-digit year. + +# Note that in this case to be fair we really have to allow ordinals too since +# in some languages that's what you would have. + +two_digit_year = n.D["'"] (b.kDigit{2} @ (n.CARDINAL_NUMBERS | e.DIGITS)); + +dot_com = ("." : "@@URL_DOT_EXPRESSION@@") n.I[" "] "com"; + +miscellaneous = Optimize[ + possibly_split_word + | post_word + | pre_word + | word_number + | number_word + | two_digit_year + | dot_com +]; + +export MISCELLANEOUS = Optimize[miscellaneous @ l.LEXICAL_MAP]; diff --git a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/money.grm b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/money.grm new file mode 100644 index 000000000..e37a7f7b3 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/money.grm @@ -0,0 +1,44 @@ +# Copyright 2017 Google Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import 'util/byte.grm' as b; +import 'en/verbalizer/lexical_map.grm' as l; +import 'en/verbalizer/numbers.grm' as n; + +card = n.CARDINAL_NUMBERS; + +__currency__ = StringFile['en/verbalizer/money.tsv']; + +d = b.kDigit; +D = d - "0"; + +cents = ((n.D["0"] | D) d) @ card; + +# Only dollar for the verbalizer tests for English. Will need to add other +# currencies. +usd_maj = Project["usd_maj" @ __currency__, 'output']; +usd_min = Project["usd_min" @ __currency__, 'output']; +and = " @@MONEY_AND@@ " | " "; + +dollar1 = + n.D["$"] card n.I[" " usd_maj] n.I[and] n.D["."] cents n.I[" " usd_min] +; + +dollar2 = n.D["$"] card n.I[" " usd_maj] n.D["."] n.D["00"]; + +dollar3 = n.D["$"] card n.I[" " usd_maj]; + +dollar = Optimize[dollar1 | dollar2 | dollar3]; + +export MONEY = Optimize[dollar @ l.LEXICAL_MAP]; diff --git a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/money.tsv b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/money.tsv new file mode 100644 index 000000000..f3965cf41 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/money.tsv @@ -0,0 +1,4 @@ +usd_maj dollar +usd_maj dollars +usd_min cent +usd_min cents diff --git a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/number_names.grm b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/number_names.grm new file mode 100644 index 000000000..3e07532fe --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/number_names.grm @@ -0,0 +1,54 @@ +# Copyright 2017 Google Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +# English minimally supervised number grammar. +# +# Supports both cardinals and ordinals without overt marking. +# +# The language-specific acceptor G was compiled with digit, teen, and decade +# preterminals. The lexicon transducer L is unambiguous so no LM is used. + +import 'util/arithmetic.grm' as a; + +# Intersects the universal factorization transducer (F) with the +# language-specific acceptor (G). + +d = a.DELTA_STAR; +f = a.IARITHMETIC_RESTRICTED; +g = LoadFst['en/verbalizer/g.fst']; +fg = Optimize[d @ Optimize[f @ Optimize[f @ Optimize[f @ g]]]]; +test1 = AssertEqual["230" @ fg, "(+ (* 2 100 *) 30 +)"]; + +# Compiles lexicon transducer (L). + +cardinal_name = StringFile['en/verbalizer/cardinals.tsv']; +cardinal_l = Optimize[(cardinal_name " ")* cardinal_name]; +test2 = AssertEqual["2 100 30" @ cardinal_l, "two hundred thirty"]; + +ordinal_name = StringFile['en/verbalizer/ordinals.tsv']; +# In English, ordinals have the same syntax as cardinals and all but the final +# element is verbalized using a cardinal number word; e.g., "two hundred +# thirtieth". +ordinal_l = Optimize[(cardinal_name " ")* ordinal_name]; +test3 = AssertEqual["2 100 30" @ ordinal_l, "two hundred thirtieth"]; + +# Composes L with the leaf transducer (P), then composes that with FG. + +p = a.LEAVES; + +export CARDINAL_NUMBER_NAME = Optimize[fg @ (p @ cardinal_l)]; +test4 = AssertEqual["230" @ CARDINAL_NUMBER_NAME, "two hundred thirty"]; + +export ORDINAL_NUMBER_NAME = Optimize[fg @ (p @ ordinal_l)]; +test5 = AssertEqual["230" @ ORDINAL_NUMBER_NAME, "two hundred thirtieth"]; diff --git a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/numbers.grm b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/numbers.grm new file mode 100644 index 000000000..e158b7a02 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/numbers.grm @@ -0,0 +1,57 @@ +# Copyright 2017 Google Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import 'en/verbalizer/number_names.grm' as n; +import 'util/byte.grm' as bytelib; +import 'universal/thousands_punct.grm' as t; + +cardinal = n.CARDINAL_NUMBER_NAME; +ordinal = n.ORDINAL_NUMBER_NAME; + +# Putting these here since this grammar gets incorporated by all the others. + +func I[expr] { + return "" : expr; +} + +func D[expr] { + return expr : ""; +} + +separators = t.comma_thousands | t.no_delimiter; + +# Language specific endings for ordinals. +d = bytelib.kDigit; +endings = "st" | "nd" | "rd" | "th"; + +st = (d* "1") - (d* "11"); +nd = (d* "2") - (d* "12"); +rd = (d* "3") - (d* "13"); +th = Optimize[d* - st - nd - rd]; +first = st ("st" : ""); +second = nd ("nd" : ""); +third = rd ("rd" : ""); +other = th ("th" : ""); +marked_ordinal = Optimize[first | second | third | other]; + +# The separator is a no-op here but will be needed once we replace +# the above targets. + +export CARDINAL_NUMBERS = Optimize[separators @ cardinal]; + +export ORDINAL_NUMBERS = + Optimize[(separators endings) @ marked_ordinal @ ordinal] +; + +export ORDINAL_NUMBERS_UNMARKED = Optimize[separators @ ordinal]; diff --git a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/numbers_plus.grm b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/numbers_plus.grm new file mode 100644 index 000000000..a152e8133 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/numbers_plus.grm @@ -0,0 +1,133 @@ +# Copyright 2017 Google Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Grammar for things built mostly on numbers. + +import 'en/verbalizer/factorization.grm' as f; +import 'en/verbalizer/lexical_map.grm' as l; +import 'en/verbalizer/numbers.grm' as n; + +num = n.CARDINAL_NUMBERS; +ord = n.ORDINAL_NUMBERS_UNMARKED; +digits = f.FRACTIONAL_PART_UNGROUPED; + +# Various symbols. + +plus = "+" : "@@ARITHMETIC_PLUS@@"; +minus = "-" : "@@ARITHMETIC_MINUS@@"; +slash = "/" : "@@SLASH@@"; +dot = "." : "@@URL_DOT_EXPRESSION@@"; +dash = "-" : "@@DASH@@"; +equals = "=" : "@@ARITHMETIC_EQUALS@@"; + +degree = "°" : "@@DEGREE@@"; + +division = ("/" | "÷") : "@@ARITHMETIC_DIVISION@@"; + +times = ("x" | "*") : "@@ARITHMETIC_TIMES@@"; + +power = "^" : "@@DECIMAL_EXPONENT@@"; + +square_root = "√" : "@@SQUARE_ROOT@@"; + +percent = "%" : "@@PERCENT@@"; + +# Safe roman numbers. + +# NB: Do not change the formatting here. NO_EDIT must be on the same +# line as the path. +rfile = + 'universal/roman_numerals.tsv' # NO_EDIT +; + +roman = StringFile[rfile]; + +## Main categories. + +cat_dot_number = + num + n.I[" "] dot n.I[" "] num + (n.I[" "] dot n.I[" "] num)+ +; + +cat_slash_number = + num + n.I[" "] slash n.I[" "] num + (n.I[" "] slash n.I[" "] num)* +; + +cat_dash_number = + num + n.I[" "] dash n.I[" "] num + (n.I[" "] dash n.I[" "] num)* +; + +cat_signed_number = ((plus | minus) n.I[" "])? num; + +cat_degree = cat_signed_number n.I[" "] degree; + +cat_country_code = plus n.I[" "] (num | digits); + +cat_math_operations = + plus + | minus + | division + | times + | equals + | percent + | power + | square_root +; + +# Roman numbers are often either cardinals or ordinals in various languages. +cat_roman = roman @ (num | ord); + +# Allow +# +# number:number +# number-number +# +# to just be +# +# number number. + +cat_number_number = + num ((":" | "-") : " ") num +; + +# Some additional readings for these symbols. + +cat_additional_readings = + ("/" : "@@PER@@") | + ("+" : "@@AND@@") | + ("-" : ("@@HYPHEN@@" | "@@CONNECTOR_TO@@")) | + ("*" : "@@STAR@@") | + ("x" : ("x" | "@@CONNECTOR_BY@@")) | + ("@" : "@@AT@@") +; + +numbers_plus = Optimize[ + cat_dot_number + | cat_slash_number + | cat_dash_number + | cat_signed_number + | cat_degree + | cat_country_code + | cat_math_operations + | cat_roman + | cat_number_number + | cat_additional_readings +]; + +export NUMBERS_PLUS = Optimize[numbers_plus @ l.LEXICAL_MAP]; diff --git a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/ordinals.tsv b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/ordinals.tsv new file mode 100644 index 000000000..f4d3d37e0 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/ordinals.tsv @@ -0,0 +1,32 @@ +0 zeroth +1 first +2 second +3 third +4 fourth +5 fifth +6 sixth +7 seventh +8 eighth +9 ninth +10 tenth +11 eleventh +12 twelfth +13 thirteenth +14 fourteenth +15 fifteenth +16 sixteenth +17 seventeenth +18 eighteenth +19 nineteenth +20 twentieth +30 thirtieth +40 fortieth +50 fiftieth +60 sixtieth +70 seventieth +80 eightieth +90 ninetieth +100 hundredth +1000 thousandth +1000000 millionth +1000000000 billionth diff --git a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/params.tsv b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/params.tsv new file mode 100644 index 000000000..d31a8a4ae --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/params.tsv @@ -0,0 +1,7 @@ +float.grm __fractional_part__ = fractional_part_ungrouped | fractional_part_unparsed; +telephone.grm __grouping__ = f.UNGROUPED; +measure.grm __measure__ = StringFile['en/verbalizer/measures.tsv']; +money.grm __currency__ = StringFile['en/verbalizer/money.tsv']; +time.grm __sep__ = ":"; +time.grm __am__ = "a.m." | "am" | "AM"; +time.grm __pm__ = "p.m." | "pm" | "PM"; diff --git a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/podspeech.grm b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/podspeech.grm new file mode 100644 index 000000000..1c67c2e3f --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/podspeech.grm @@ -0,0 +1,46 @@ +# Copyright 2017 Google Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import 'util/util.grm' as util; +import 'util/case.grm' as case; +import 'en/verbalizer/extra_numbers.grm' as e; +import 'en/verbalizer/float.grm' as f; +import 'en/verbalizer/math.grm' as ma; +import 'en/verbalizer/miscellaneous.grm' as mi; +import 'en/verbalizer/money.grm' as mo; +import 'en/verbalizer/numbers.grm' as n; +import 'en/verbalizer/numbers_plus.grm' as np; +import 'en/verbalizer/spelled.grm' as s; +import 'en/verbalizer/spoken_punct.grm' as sp; +import 'en/verbalizer/time.grm' as t; +import 'en/verbalizer/urls.grm' as u; + +export POD_SPEECH_TN = Optimize[RmWeight[ + (u.URL + | e.MIXED_NUMBERS + | e.DIGITS + | f.FLOAT + | ma.ARITHMETIC + | mo.MONEY + | n.CARDINAL_NUMBERS + | n.ORDINAL_NUMBERS + | np.NUMBERS_PLUS + | s.SPELLED + | sp.SPOKEN_PUNCT + | t.TIME + | u.URL + | u.EMAILS) @ util.CLEAN_SPACES @ case.TOUPPER +]]; + +#export POD_SPEECH_TN = Optimize[RmWeight[(mi.MISCELLANEOUS) @ util.CLEAN_SPACES @ case.TOUPPER]]; diff --git a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/spelled.grm b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/spelled.grm new file mode 100644 index 000000000..b04974d2a --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/spelled.grm @@ -0,0 +1,77 @@ +# Copyright 2017 Google Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# This verbalizer is used whenever there is an LM symbol that consists of +# letters immediately followed by "{spelled}".l This strips the "{spelled}" +# suffix. + +import 'util/byte.grm' as b; +import 'ru/classifier/cyrillic.grm' as c; +import 'en/verbalizer/lexical_map.grm' as l; +import 'en/verbalizer/numbers.grm' as n; + +digit = b.kDigit @ n.CARDINAL_NUMBERS; + +char_set = (("a" | "A") : "letter-a") + | (("b" | "B") : "letter-b") + | (("c" | "C") : "letter-c") + | (("d" | "D") : "letter-d") + | (("e" | "E") : "letter-e") + | (("f" | "F") : "letter-f") + | (("g" | "G") : "letter-g") + | (("h" | "H") : "letter-h") + | (("i" | "I") : "letter-i") + | (("j" | "J") : "letter-j") + | (("k" | "K") : "letter-k") + | (("l" | "L") : "letter-l") + | (("m" | "M") : "letter-m") + | (("n" | "N") : "letter-n") + | (("o" | "O") : "letter-o") + | (("p" | "P") : "letter-p") + | (("q" | "Q") : "letter-q") + | (("r" | "R") : "letter-r") + | (("s" | "S") : "letter-s") + | (("t" | "T") : "letter-t") + | (("u" | "U") : "letter-u") + | (("v" | "V") : "letter-v") + | (("w" | "W") : "letter-w") + | (("x" | "X") : "letter-x") + | (("y" | "Y") : "letter-y") + | (("z" | "Z") : "letter-z") + | (digit) + | ("&" : "@@AND@@") + | ("." : "") + | ("-" : "") + | ("_" : "") + | ("/" : "") + | (n.I["letter-"] c.kCyrillicAlpha) + ; + +ins_space = "" : " "; + +suffix = "{spelled}" : ""; + +spelled = Optimize[char_set (ins_space char_set)* suffix]; + +export SPELLED = Optimize[spelled @ l.LEXICAL_MAP]; + +sigma_star = b.kBytes*; + +# Gets rid of the letter- prefix since in some cases we don't want it. + +del_letter = CDRewrite[n.D["letter-"], "", "", sigma_star]; + +spelled_no_tag = Optimize[char_set (ins_space char_set)*]; + +export SPELLED_NO_LETTER = Optimize[spelled_no_tag @ del_letter]; diff --git a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/spoken_punct.grm b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/spoken_punct.grm new file mode 100644 index 000000000..b0db6535b --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/spoken_punct.grm @@ -0,0 +1,24 @@ +# Copyright 2017 Google Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import 'en/verbalizer/lexical_map.grm' as l; + +punct = + ("." : "@@PERIOD@@") + | ("," : "@@COMMA@@") + | ("!" : "@@EXCLAMATION_MARK@@") + | ("?" : "@@QUESTION_MARK@@") +; + +export SPOKEN_PUNCT = Optimize[punct @ l.LEXICAL_MAP]; diff --git a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/time.grm b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/time.grm new file mode 100644 index 000000000..0bf92d0ab --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/time.grm @@ -0,0 +1,108 @@ +# Copyright 2017 Google Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import 'util/byte.grm' as b; +import 'en/verbalizer/lexical_map.grm' as l; +import 'en/verbalizer/numbers.grm' as n; + +# Only handles 24-hour time with quarter-to, half-past and quarter-past. + +increment_hour = + ("0" : "1") + | ("1" : "2") + | ("2" : "3") + | ("3" : "4") + | ("4" : "5") + | ("5" : "6") + | ("6" : "7") + | ("7" : "8") + | ("8" : "9") + | ("9" : "10") + | ("10" : "11") + | ("11" : "12") + | ("12" : "1") # If someone uses 12, we assume 12-hour by default. + | ("13" : "14") + | ("14" : "15") + | ("15" : "16") + | ("16" : "17") + | ("17" : "18") + | ("18" : "19") + | ("19" : "20") + | ("20" : "21") + | ("21" : "22") + | ("22" : "23") + | ("23" : "12") +; + +hours = Project[increment_hour, 'input']; + +d = b.kDigit; +D = d - "0"; + +minutes09 = "0" D; + +minutes = ("1" | "2" | "3" | "4" | "5") d; + +__sep__ = ":"; +sep_space = __sep__ : " "; + +verbalize_hours = hours @ n.CARDINAL_NUMBERS; + +verbalize_minutes = + ("00" : "@@HOUR@@") + | (minutes09 @ (("0" : "@@TIME_ZERO@@") n.I[" "] n.CARDINAL_NUMBERS)) + | (minutes @ n.CARDINAL_NUMBERS) +; + +time_basic = Optimize[verbalize_hours sep_space verbalize_minutes]; + +# Special cases we handle right now. +# TODO: Need to allow for cases like +# +# half twelve (in the UK English sense) +# half twaalf (in the Dutch sense) + +time_quarter_past = + n.I["@@TIME_QUARTER@@ @@TIME_AFTER@@ "] + verbalize_hours + n.D[__sep__ "15"]; + +time_half_past = + n.I["@@TIME_HALF@@ @@TIME_AFTER@@ "] + verbalize_hours + n.D[__sep__ "30"]; + +time_quarter_to = + n.I["@@TIME_QUARTER@@ @@TIME_BEFORE@@ "] + (increment_hour @ verbalize_hours) + n.D[__sep__ "45"]; + +time_extra = Optimize[ + time_quarter_past | time_half_past | time_quarter_to] +; + +# Basic time periods which most languages can be expected to have. +__am__ = "a.m." | "am" | "AM"; +__pm__ = "p.m." | "pm" | "PM"; + +period = (__am__ : "@@TIME_AM@@") | (__pm__ : "@@TIME_PM@@"); + +time_variants = time_basic | time_extra; + +time = Optimize[ + (period (" " | n.I[" "]))? time_variants + | time_variants ((" " | n.I[" "]) period)?] +; + +export TIME = Optimize[time @ l.LEXICAL_MAP]; diff --git a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/urls.grm b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/urls.grm new file mode 100644 index 000000000..a2232f9bc --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/urls.grm @@ -0,0 +1,68 @@ +# Copyright 2017 Google Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +# Rules for URLs and email addresses. + +import 'util/byte.grm' as bytelib; +import 'en/verbalizer/lexical_map.grm' as l; + +ins_space = "" : " "; +dot = "." : "@@URL_DOT_EXPRESSION@@"; +at = "@" : "@@AT@@"; + +url_suffix = + (".com" : dot ins_space "com") | + (".gov" : dot ins_space "gov") | + (".edu" : dot ins_space "e d u") | + (".org" : dot ins_space "org") | + (".net" : dot ins_space "net") +; + +letter_string = (bytelib.kAlnum)* bytelib.kAlnum; + +letter_string_dot = + ((letter_string ins_space dot ins_space)* letter_string) +; + +# Rules for URLs. +export URL = Optimize[ + ((letter_string_dot) (ins_space) + (url_suffix)) @ l.LEXICAL_MAP +]; + +# Rules for email addresses. +letter_by_letter = ((bytelib.kAlnum ins_space)* bytelib.kAlnum); + +letter_by_letter_dot = + ((letter_by_letter ins_space dot ins_space)* + letter_by_letter) +; + +export EMAIL1 = Optimize[ + ((letter_by_letter) (ins_space) + (at) (ins_space) + (letter_by_letter_dot) (ins_space) + (url_suffix)) @ l.LEXICAL_MAP +]; + +export EMAIL2 = Optimize[ + ((letter_by_letter) (ins_space) + (at) (ins_space) + (letter_string_dot) (ins_space) + (url_suffix)) @ l.LEXICAL_MAP +]; + +export EMAILS = Optimize[ + EMAIL1 | EMAIL2 +]; diff --git a/third_party/chinese_text_normalization/thrax/src/en/verbalizer/verbalizer.grm b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/verbalizer.grm new file mode 100644 index 000000000..fe6f4e42c --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/en/verbalizer/verbalizer.grm @@ -0,0 +1,42 @@ +# Copyright 2017 Google Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import 'util/util.grm' as util; +import 'en/verbalizer/extra_numbers.grm' as e; +import 'en/verbalizer/float.grm' as f; +import 'en/verbalizer/math.grm' as ma; +import 'en/verbalizer/miscellaneous.grm' as mi; +import 'en/verbalizer/money.grm' as mo; +import 'en/verbalizer/numbers.grm' as n; +import 'en/verbalizer/numbers_plus.grm' as np; +import 'en/verbalizer/spelled.grm' as s; +import 'en/verbalizer/spoken_punct.grm' as sp; +import 'en/verbalizer/time.grm' as t; +import 'en/verbalizer/urls.grm' as u; + +export VERBALIZER = Optimize[RmWeight[ + ( e.MIXED_NUMBERS + | e.DIGITS + | f.FLOAT + | ma.ARITHMETIC + | mi.MISCELLANEOUS + | mo.MONEY + | n.CARDINAL_NUMBERS + | n.ORDINAL_NUMBERS + | np.NUMBERS_PLUS + | s.SPELLED + | sp.SPOKEN_PUNCT + | t.TIME + | u.URL) @ util.CLEAN_SPACES +]]; diff --git a/third_party/chinese_text_normalization/thrax/src/number_data/README.md b/third_party/chinese_text_normalization/thrax/src/number_data/README.md new file mode 100644 index 000000000..dd76ad16c --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/number_data/README.md @@ -0,0 +1,17 @@ +This directory contains data used in: + + Gorman, K., and Sproat, R. 2016. Minimally supervised number normalization. + Transactions of the Association for Computational Linguistics 4: 507-519. + +* `minimal.txt`: A list of 30 curated numbers used as the "minimal" training + set. +* `random-trn.txt`: A list of 9000 randomly-generated numbers used as the + "medium" training set. +* `random-tst.txt`: A list of 1000 randomly-generated numbers used as the test + set. + +Note that `random-trn.txt` and `random-tst.txt` are totally disjoint, but that +a small number of examples occur both in `minimal.txt` and `random-tst.txt`. + +For information about the sampling procedure used to generate the random data +sets, see appendix A of the aforementioned paper. diff --git a/third_party/chinese_text_normalization/thrax/src/number_data/minimal.txt b/third_party/chinese_text_normalization/thrax/src/number_data/minimal.txt new file mode 100644 index 000000000..dd0704fd9 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/number_data/minimal.txt @@ -0,0 +1,300 @@ +0 +1 +2 +3 +4 +5 +6 +7 +8 +9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 +53 +54 +55 +56 +57 +58 +59 +60 +61 +62 +63 +64 +65 +66 +67 +68 +69 +70 +71 +72 +73 +74 +75 +76 +77 +78 +79 +80 +81 +82 +83 +84 +85 +86 +87 +88 +89 +90 +91 +92 +93 +94 +95 +96 +97 +98 +99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 +152 +153 +154 +155 +156 +157 +158 +159 +160 +161 +162 +163 +164 +165 +166 +167 +168 +169 +170 +171 +172 +173 +174 +175 +176 +177 +178 +179 +180 +181 +182 +183 +184 +185 +186 +187 +188 +189 +190 +191 +192 +193 +194 +195 +196 +197 +198 +199 +200 +201 +202 +203 +204 +205 +206 +207 +208 +209 +210 +211 +212 +220 +221 +230 +300 +400 +500 +600 +700 +800 +900 +1000 +1001 +1002 +1003 +1004 +1005 +1006 +1007 +1008 +1009 +1010 +1011 +1012 +1020 +1021 +1030 +1200 +2000 +2001 +2002 +2003 +2004 +2005 +2006 +2007 +2008 +2009 +2010 +2011 +2012 +2020 +2021 +2030 +2100 +2200 +5001 +10000 +12000 +20000 +21000 +50001 +100000 +120000 +200000 +210000 +500001 +1000000 +1001000 +1200000 +2000000 +2100000 +5000001 +10000000 +10001000 +12000000 +20000000 +50000001 +100000000 +100001000 +120000000 +200000000 +500000001 +1000000000 +1000001000 +1200000000 +2000000000 +5000000001 +10000000000 +10000001000 +12000000000 +20000000000 +50000000001 +100000000000 +100000001000 +120000000000 +200000000000 +500000000001 diff --git a/third_party/chinese_text_normalization/thrax/src/number_data/random-trn.txt b/third_party/chinese_text_normalization/thrax/src/number_data/random-trn.txt new file mode 100644 index 000000000..103a7063d --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/number_data/random-trn.txt @@ -0,0 +1,9000 @@ +0 +1 +2 +3 +4 +5 +6 +7 +8 +9 +10 +11 +12 +13 +14 +15 +16 +17 +18 +19 +20 +21 +22 +23 +24 +25 +26 +27 +28 +29 +30 +31 +32 +33 +34 +35 +36 +37 +38 +39 +40 +41 +42 +43 +44 +45 +46 +47 +48 +49 +50 +51 +52 +53 +54 +55 +56 +57 +58 +59 +60 +61 +62 +63 +64 +65 +66 +67 +68 +69 +70 +71 +72 +73 +74 +75 +76 +77 +78 +79 +80 +81 +82 +83 +84 +85 +86 +87 +88 +89 +90 +91 +92 +93 +94 +95 +96 +97 +98 +99 +100 +101 +102 +103 +104 +105 +106 +107 +108 +109 +110 +111 +112 +113 +114 +115 +116 +117 +118 +119 +120 +121 +122 +123 +124 +125 +126 +127 +128 +129 +130 +131 +132 +133 +134 +135 +136 +137 +138 +139 +140 +141 +142 +143 +144 +145 +146 +147 +148 +149 +150 +151 +152 +153 +154 +155 +156 +157 +158 +159 +160 +161 +162 +163 +164 +165 +166 +167 +168 +169 +170 +171 +172 +173 +174 +175 +176 +177 +178 +179 +180 +181 +182 +183 +184 +185 +186 +187 +188 +189 +190 +191 +192 +193 +194 +195 +196 +197 +198 +199 +200 +201 +202 +203 +206 +208 +210 +212 +213 +214 +215 +217 +218 +219 +221 +223 +224 +226 +228 +229 +231 +232 +234 +235 +238 +240 +242 +247 +249 +251 +253 +256 +259 +260 +261 +267 +268 +272 +278 +279 +280 +283 +284 +286 +287 +288 +289 +292 +296 +297 +299 +300 +305 +306 +307 +309 +311 +312 +314 +315 +316 +318 +319 +320 +321 +325 +326 +327 +328 +329 +330 +331 +333 +334 +335 +336 +339 +340 +341 +343 +345 +350 +351 +352 +353 +354 +355 +356 +360 +361 +362 +369 +372 +373 +374 +375 +378 +380 +382 +383 +384 +385 +387 +388 +389 +391 +393 +394 +400 +403 +404 +406 +407 +409 +411 +412 +413 +415 +416 +419 +420 +421 +423 +426 +427 +430 +432 +433 +435 +437 +438 +439 +440 +441 +442 +444 +445 +447 +448 +449 +453 +454 +456 +457 +460 +461 +463 +464 +465 +467 +468 +470 +474 +475 +476 +477 +478 +479 +480 +481 +483 +485 +486 +487 +488 +490 +491 +492 +493 +496 +497 +498 +499 +500 +501 +502 +504 +506 +508 +509 +511 +512 +514 +515 +517 +518 +521 +522 +529 +532 +533 +534 +536 +539 +542 +543 +544 +545 +546 +549 +550 +551 +552 +553 +557 +559 +560 +561 +562 +564 +565 +567 +568 +570 +571 +572 +575 +576 +577 +579 +581 +582 +583 +584 +586 +588 +590 +591 +592 +597 +598 +600 +602 +603 +606 +608 +609 +610 +611 +613 +616 +617 +618 +623 +626 +628 +630 +631 +632 +634 +635 +637 +638 +639 +641 +643 +644 +645 +647 +649 +651 +659 +660 +661 +663 +669 +670 +672 +673 +674 +676 +683 +684 +686 +687 +690 +691 +695 +696 +697 +698 +699 +700 +702 +703 +705 +706 +707 +709 +712 +714 +717 +718 +719 +720 +721 +722 +724 +727 +728 +731 +733 +734 +735 +736 +739 +743 +744 +745 +746 +748 +749 +752 +753 +754 +755 +756 +758 +759 +762 +764 +765 +767 +769 +771 +772 +773 +774 +777 +778 +779 +780 +782 +783 +784 +785 +786 +787 +789 +791 +794 +795 +796 +798 +799 +800 +801 +802 +805 +806 +807 +808 +809 +811 +812 +813 +816 +817 +819 +820 +822 +823 +824 +825 +826 +827 +828 +830 +832 +833 +834 +835 +841 +842 +844 +846 +848 +850 +852 +853 +855 +856 +857 +858 +859 +861 +862 +863 +864 +866 +867 +869 +870 +871 +874 +876 +877 +880 +882 +885 +886 +887 +888 +891 +894 +895 +896 +897 +901 +905 +906 +907 +908 +913 +914 +915 +916 +919 +920 +923 +925 +926 +929 +931 +933 +935 +936 +937 +938 +939 +941 +942 +946 +951 +952 +953 +954 +955 +957 +958 +961 +962 +967 +971 +976 +977 +978 +979 +980 +981 +982 +983 +988 +990 +991 +992 +993 +996 +997 +998 +1001 +1004 +1005 +1006 +1007 +1008 +1009 +1010 +1011 +1012 +1015 +1019 +1020 +1021 +1024 +1026 +1028 +1030 +1031 +1033 +1034 +1036 +1037 +1039 +1040 +1041 +1042 +1043 +1045 +1046 +1048 +1049 +1051 +1053 +1054 +1056 +1058 +1060 +1061 +1064 +1065 +1067 +1069 +1072 +1073 +1074 +1075 +1078 +1080 +1082 +1084 +1086 +1088 +1089 +1091 +1092 +1093 +1095 +1096 +1099 +1100 +1102 +1103 +1105 +1106 +1109 +1110 +1111 +1112 +1114 +1116 +1117 +1118 +1119 +1120 +1122 +1125 +1126 +1127 +1128 +1130 +1135 +1136 +1138 +1142 +1143 +1145 +1146 +1147 +1148 +1149 +1150 +1153 +1154 +1157 +1158 +1162 +1163 +1164 +1165 +1166 +1169 +1170 +1171 +1174 +1175 +1178 +1180 +1181 +1183 +1185 +1187 +1188 +1190 +1191 +1193 +1195 +1198 +1200 +1201 +1204 +1206 +1207 +1208 +1209 +1212 +1215 +1216 +1217 +1219 +1220 +1221 +1223 +1224 +1225 +1227 +1228 +1230 +1232 +1233 +1236 +1238 +1239 +1243 +1244 +1245 +1246 +1249 +1250 +1251 +1254 +1256 +1261 +1263 +1264 +1268 +1269 +1270 +1271 +1272 +1274 +1276 +1277 +1283 +1284 +1287 +1288 +1289 +1291 +1292 +1294 +1296 +1298 +1300 +1303 +1304 +1305 +1306 +1309 +1313 +1316 +1317 +1319 +1320 +1321 +1322 +1323 +1324 +1328 +1330 +1331 +1333 +1335 +1336 +1337 +1338 +1341 +1346 +1348 +1349 +1350 +1351 +1353 +1354 +1355 +1356 +1357 +1359 +1360 +1361 +1362 +1363 +1365 +1366 +1367 +1370 +1371 +1373 +1374 +1376 +1377 +1380 +1381 +1382 +1384 +1386 +1387 +1388 +1390 +1392 +1394 +1395 +1396 +1400 +1403 +1405 +1406 +1407 +1408 +1409 +1410 +1412 +1413 +1414 +1415 +1416 +1419 +1421 +1422 +1423 +1424 +1425 +1427 +1429 +1433 +1434 +1435 +1436 +1437 +1438 +1440 +1443 +1445 +1446 +1448 +1454 +1457 +1460 +1461 +1465 +1468 +1469 +1470 +1474 +1475 +1477 +1483 +1485 +1487 +1488 +1489 +1490 +1491 +1493 +1494 +1496 +1497 +1498 +1501 +1502 +1503 +1505 +1506 +1507 +1508 +1510 +1511 +1512 +1513 +1514 +1515 +1518 +1519 +1520 +1522 +1523 +1525 +1526 +1529 +1530 +1531 +1532 +1534 +1536 +1537 +1539 +1540 +1541 +1542 +1543 +1544 +1545 +1546 +1554 +1555 +1556 +1560 +1561 +1562 +1563 +1564 +1565 +1567 +1568 +1569 +1570 +1572 +1573 +1575 +1576 +1577 +1578 +1579 +1580 +1584 +1586 +1588 +1590 +1591 +1595 +1596 +1598 +1600 +1601 +1602 +1603 +1609 +1610 +1611 +1614 +1616 +1617 +1620 +1621 +1622 +1625 +1627 +1631 +1632 +1633 +1636 +1637 +1641 +1644 +1645 +1650 +1652 +1658 +1659 +1663 +1664 +1665 +1666 +1667 +1669 +1670 +1671 +1673 +1677 +1678 +1679 +1680 +1681 +1682 +1683 +1691 +1694 +1697 +1699 +1700 +1701 +1702 +1703 +1704 +1706 +1708 +1712 +1713 +1715 +1718 +1721 +1723 +1724 +1725 +1726 +1727 +1728 +1730 +1731 +1736 +1740 +1741 +1742 +1744 +1746 +1749 +1750 +1751 +1752 +1753 +1754 +1755 +1756 +1763 +1766 +1767 +1768 +1769 +1772 +1775 +1777 +1778 +1780 +1783 +1784 +1785 +1791 +1793 +1795 +1796 +1798 +1800 +1801 +1803 +1804 +1805 +1810 +1812 +1814 +1816 +1818 +1820 +1822 +1823 +1825 +1826 +1828 +1829 +1830 +1833 +1836 +1837 +1839 +1842 +1844 +1845 +1846 +1848 +1852 +1853 +1854 +1855 +1857 +1858 +1859 +1860 +1861 +1862 +1863 +1864 +1865 +1866 +1868 +1869 +1874 +1876 +1879 +1882 +1884 +1885 +1886 +1887 +1888 +1890 +1892 +1893 +1894 +1895 +1896 +1898 +1899 +1900 +1901 +1902 +1903 +1906 +1907 +1908 +1909 +1910 +1911 +1912 +1913 +1914 +1916 +1917 +1918 +1919 +1920 +1921 +1925 +1927 +1929 +1935 +1936 +1938 +1939 +1940 +1941 +1943 +1946 +1947 +1948 +1953 +1954 +1956 +1957 +1958 +1959 +1960 +1961 +1962 +1963 +1968 +1971 +1973 +1978 +1979 +1980 +1981 +1985 +1988 +1989 +1991 +1993 +1994 +1997 +1998 +1999 +2000 +2003 +2004 +2005 +2007 +2008 +2009 +2011 +2014 +2015 +2016 +2017 +2021 +2022 +2023 +2027 +2028 +2030 +2031 +2032 +2033 +2034 +2036 +2037 +2039 +2042 +2043 +2044 +2045 +2046 +2050 +2051 +2053 +2054 +2057 +2060 +2061 +2063 +2065 +2066 +2068 +2069 +2071 +2074 +2076 +2077 +2078 +2079 +2080 +2081 +2084 +2085 +2086 +2088 +2089 +2090 +2091 +2092 +2093 +2094 +2095 +2096 +2097 +2098 +2100 +2101 +2102 +2105 +2106 +2109 +2111 +2112 +2113 +2114 +2115 +2117 +2118 +2121 +2122 +2125 +2127 +2128 +2131 +2134 +2136 +2137 +2138 +2141 +2144 +2145 +2147 +2148 +2149 +2150 +2153 +2154 +2156 +2157 +2158 +2159 +2160 +2161 +2162 +2163 +2164 +2165 +2169 +2170 +2172 +2173 +2175 +2177 +2178 +2183 +2185 +2189 +2190 +2192 +2195 +2196 +2198 +2199 +2202 +2204 +2207 +2208 +2210 +2211 +2212 +2213 +2214 +2216 +2217 +2218 +2219 +2221 +2222 +2224 +2225 +2227 +2229 +2230 +2231 +2232 +2233 +2235 +2236 +2239 +2240 +2241 +2242 +2243 +2244 +2245 +2247 +2249 +2252 +2253 +2254 +2255 +2257 +2260 +2261 +2263 +2264 +2266 +2267 +2269 +2271 +2272 +2273 +2274 +2278 +2281 +2282 +2284 +2288 +2289 +2291 +2293 +2296 +2297 +2298 +2299 +2301 +2302 +2304 +2305 +2308 +2310 +2312 +2314 +2316 +2317 +2319 +2320 +2322 +2324 +2325 +2328 +2330 +2331 +2332 +2333 +2335 +2336 +2337 +2338 +2339 +2340 +2343 +2344 +2345 +2346 +2348 +2350 +2351 +2352 +2353 +2354 +2356 +2358 +2359 +2360 +2365 +2366 +2368 +2369 +2370 +2371 +2373 +2374 +2375 +2376 +2377 +2381 +2382 +2383 +2385 +2386 +2387 +2388 +2393 +2394 +2395 +2397 +2398 +2401 +2403 +2404 +2406 +2407 +2408 +2409 +2411 +2412 +2415 +2421 +2422 +2423 +2424 +2425 +2431 +2432 +2434 +2440 +2442 +2443 +2446 +2447 +2453 +2454 +2455 +2457 +2458 +2459 +2460 +2461 +2462 +2465 +2469 +2470 +2471 +2473 +2477 +2478 +2479 +2480 +2481 +2485 +2486 +2487 +2489 +2495 +2501 +2503 +2504 +2506 +2509 +2510 +2512 +2515 +2516 +2517 +2518 +2522 +2525 +2526 +2527 +2528 +2529 +2530 +2533 +2534 +2536 +2538 +2539 +2540 +2543 +2546 +2548 +2551 +2552 +2553 +2555 +2556 +2559 +2561 +2564 +2565 +2566 +2567 +2570 +2572 +2573 +2574 +2575 +2577 +2580 +2581 +2583 +2584 +2585 +2589 +2591 +2592 +2594 +2595 +2596 +2597 +2600 +2603 +2604 +2607 +2608 +2610 +2613 +2614 +2615 +2618 +2621 +2622 +2624 +2625 +2626 +2627 +2630 +2631 +2632 +2634 +2638 +2640 +2641 +2642 +2643 +2644 +2645 +2646 +2648 +2649 +2650 +2653 +2655 +2656 +2657 +2658 +2661 +2662 +2665 +2666 +2667 +2668 +2671 +2672 +2674 +2675 +2676 +2677 +2679 +2681 +2683 +2684 +2686 +2688 +2691 +2694 +2695 +2696 +2701 +2703 +2704 +2705 +2706 +2707 +2708 +2709 +2710 +2711 +2712 +2713 +2714 +2719 +2720 +2721 +2725 +2726 +2728 +2731 +2732 +2735 +2736 +2737 +2738 +2740 +2741 +2743 +2744 +2745 +2748 +2749 +2750 +2751 +2754 +2756 +2758 +2759 +2764 +2767 +2769 +2770 +2771 +2772 +2773 +2775 +2776 +2780 +2781 +2783 +2784 +2786 +2787 +2788 +2789 +2790 +2794 +2798 +2799 +2800 +2802 +2803 +2804 +2806 +2808 +2810 +2811 +2813 +2814 +2815 +2820 +2822 +2824 +2826 +2827 +2829 +2830 +2831 +2832 +2833 +2834 +2835 +2837 +2840 +2841 +2845 +2846 +2847 +2848 +2851 +2852 +2854 +2855 +2856 +2858 +2860 +2861 +2862 +2863 +2867 +2868 +2869 +2870 +2871 +2872 +2874 +2876 +2877 +2878 +2881 +2882 +2883 +2884 +2886 +2887 +2889 +2890 +2891 +2892 +2893 +2894 +2895 +2897 +2898 +2900 +2903 +2906 +2907 +2911 +2912 +2914 +2915 +2917 +2920 +2924 +2925 +2930 +2931 +2932 +2933 +2934 +2936 +2938 +2939 +2940 +2941 +2942 +2943 +2944 +2945 +2947 +2948 +2949 +2951 +2952 +2953 +2955 +2956 +2957 +2958 +2959 +2962 +2964 +2965 +2966 +2967 +2968 +2972 +2973 +2979 +2981 +2982 +2983 +2985 +2986 +2988 +2990 +2992 +2993 +2996 +2999 +3000 +3001 +3002 +3005 +3006 +3007 +3009 +3012 +3013 +3015 +3016 +3017 +3018 +3019 +3020 +3022 +3023 +3024 +3027 +3028 +3029 +3031 +3033 +3035 +3036 +3037 +3038 +3041 +3044 +3045 +3046 +3048 +3050 +3051 +3052 +3053 +3055 +3056 +3058 +3059 +3062 +3064 +3065 +3067 +3068 +3069 +3071 +3072 +3073 +3074 +3075 +3079 +3083 +3085 +3087 +3088 +3089 +3090 +3092 +3094 +3099 +3100 +3101 +3102 +3105 +3106 +3107 +3109 +3110 +3116 +3117 +3122 +3124 +3125 +3126 +3127 +3129 +3131 +3132 +3133 +3134 +3135 +3138 +3139 +3140 +3143 +3144 +3146 +3148 +3151 +3153 +3154 +3158 +3159 +3160 +3161 +3162 +3163 +3165 +3166 +3167 +3168 +3169 +3170 +3172 +3181 +3184 +3186 +3187 +3188 +3189 +3191 +3192 +3193 +3194 +3195 +3198 +3199 +3201 +3202 +3203 +3204 +3206 +3209 +3211 +3216 +3221 +3224 +3225 +3226 +3228 +3229 +3231 +3232 +3236 +3238 +3239 +3241 +3242 +3246 +3249 +3251 +3252 +3253 +3254 +3256 +3261 +3263 +3264 +3266 +3268 +3269 +3272 +3274 +3275 +3277 +3278 +3281 +3282 +3284 +3285 +3287 +3288 +3289 +3290 +3291 +3292 +3293 +3294 +3295 +3297 +3299 +3300 +3301 +3302 +3306 +3307 +3308 +3309 +3310 +3311 +3313 +3319 +3320 +3321 +3322 +3323 +3325 +3326 +3327 +3328 +3329 +3331 +3332 +3334 +3335 +3336 +3337 +3338 +3340 +3342 +3343 +3345 +3346 +3348 +3351 +3352 +3354 +3358 +3359 +3360 +3362 +3363 +3366 +3369 +3371 +3374 +3375 +3376 +3380 +3381 +3382 +3386 +3389 +3390 +3391 +3395 +3397 +3399 +3402 +3404 +3406 +3407 +3411 +3413 +3414 +3415 +3417 +3418 +3419 +3420 +3421 +3422 +3426 +3428 +3429 +3430 +3431 +3434 +3443 +3444 +3445 +3446 +3447 +3448 +3450 +3451 +3452 +3455 +3456 +3457 +3458 +3460 +3461 +3462 +3464 +3465 +3466 +3467 +3468 +3469 +3470 +3471 +3472 +3473 +3476 +3478 +3480 +3481 +3483 +3484 +3485 +3487 +3488 +3489 +3491 +3492 +3494 +3496 +3498 +3499 +3503 +3504 +3506 +3507 +3509 +3511 +3512 +3515 +3520 +3521 +3523 +3524 +3527 +3529 +3530 +3532 +3533 +3534 +3536 +3538 +3539 +3540 +3541 +3542 +3544 +3546 +3547 +3554 +3555 +3556 +3560 +3561 +3562 +3563 +3565 +3566 +3567 +3571 +3574 +3575 +3578 +3579 +3580 +3581 +3582 +3583 +3584 +3585 +3587 +3589 +3590 +3591 +3592 +3593 +3594 +3595 +3596 +3599 +3600 +3601 +3604 +3605 +3609 +3610 +3611 +3613 +3614 +3615 +3616 +3619 +3621 +3622 +3623 +3624 +3627 +3628 +3631 +3632 +3633 +3635 +3637 +3640 +3641 +3643 +3644 +3646 +3648 +3650 +3654 +3655 +3656 +3657 +3658 +3660 +3661 +3662 +3663 +3669 +3670 +3671 +3672 +3673 +3677 +3678 +3679 +3682 +3686 +3687 +3688 +3689 +3690 +3693 +3694 +3698 +3699 +3700 +3703 +3704 +3705 +3707 +3709 +3710 +3711 +3714 +3715 +3718 +3723 +3724 +3725 +3726 +3727 +3729 +3730 +3731 +3733 +3734 +3735 +3736 +3737 +3738 +3743 +3744 +3745 +3748 +3749 +3751 +3752 +3753 +3757 +3760 +3762 +3764 +3766 +3767 +3768 +3769 +3770 +3772 +3774 +3775 +3778 +3779 +3780 +3782 +3784 +3786 +3789 +3791 +3792 +3793 +3794 +3795 +3796 +3799 +3800 +3801 +3802 +3805 +3806 +3810 +3812 +3814 +3815 +3817 +3818 +3819 +3820 +3823 +3824 +3830 +3833 +3835 +3837 +3838 +3839 +3842 +3843 +3849 +3851 +3853 +3855 +3856 +3857 +3858 +3859 +3861 +3862 +3863 +3865 +3866 +3867 +3868 +3869 +3872 +3874 +3876 +3877 +3878 +3880 +3884 +3886 +3888 +3891 +3892 +3895 +3898 +3903 +3904 +3905 +3906 +3908 +3909 +3914 +3915 +3916 +3917 +3918 +3919 +3920 +3922 +3923 +3924 +3925 +3930 +3932 +3934 +3936 +3938 +3939 +3940 +3946 +3947 +3948 +3949 +3952 +3954 +3955 +3958 +3959 +3960 +3962 +3965 +3967 +3970 +3974 +3975 +3976 +3977 +3978 +3980 +3981 +3985 +3987 +3988 +3991 +3997 +3998 +3999 +4001 +4002 +4003 +4004 +4005 +4006 +4008 +4009 +4015 +4017 +4021 +4022 +4023 +4026 +4027 +4028 +4029 +4030 +4033 +4035 +4036 +4037 +4039 +4041 +4044 +4050 +4051 +4053 +4056 +4057 +4060 +4067 +4068 +4073 +4075 +4076 +4077 +4078 +4080 +4081 +4082 +4085 +4090 +4091 +4092 +4094 +4096 +4098 +4101 +4102 +4104 +4107 +4111 +4112 +4113 +4115 +4117 +4118 +4120 +4122 +4124 +4130 +4134 +4136 +4137 +4141 +4142 +4144 +4147 +4150 +4152 +4153 +4154 +4157 +4158 +4159 +4161 +4163 +4164 +4165 +4168 +4169 +4170 +4171 +4174 +4175 +4176 +4181 +4182 +4184 +4188 +4189 +4190 +4191 +4193 +4194 +4195 +4196 +4197 +4198 +4200 +4203 +4204 +4207 +4208 +4209 +4210 +4211 +4212 +4214 +4215 +4216 +4218 +4221 +4222 +4223 +4224 +4225 +4226 +4227 +4228 +4230 +4231 +4233 +4235 +4236 +4237 +4239 +4242 +4244 +4245 +4247 +4250 +4251 +4252 +4253 +4255 +4257 +4258 +4262 +4263 +4264 +4265 +4266 +4268 +4271 +4272 +4273 +4276 +4277 +4278 +4279 +4282 +4283 +4285 +4286 +4289 +4292 +4294 +4295 +4297 +4299 +4301 +4304 +4306 +4308 +4310 +4313 +4319 +4321 +4323 +4324 +4328 +4329 +4330 +4332 +4335 +4336 +4340 +4342 +4355 +4356 +4357 +4360 +4363 +4364 +4365 +4366 +4368 +4370 +4371 +4372 +4373 +4374 +4376 +4377 +4378 +4379 +4380 +4382 +4383 +4384 +4385 +4387 +4390 +4392 +4393 +4395 +4396 +4398 +4399 +4400 +4401 +4402 +4403 +4405 +4406 +4407 +4408 +4411 +4412 +4414 +4415 +4420 +4422 +4426 +4429 +4430 +4431 +4436 +4439 +4440 +4441 +4442 +4443 +4444 +4448 +4449 +4452 +4453 +4455 +4456 +4457 +4458 +4462 +4463 +4465 +4466 +4467 +4468 +4473 +4475 +4476 +4477 +4479 +4480 +4481 +4482 +4483 +4484 +4488 +4491 +4492 +4493 +4495 +4496 +4499 +4500 +4501 +4504 +4505 +4509 +4510 +4512 +4513 +4514 +4516 +4517 +4518 +4519 +4528 +4529 +4530 +4532 +4533 +4534 +4535 +4537 +4541 +4542 +4547 +4549 +4550 +4552 +4555 +4556 +4559 +4561 +4562 +4563 +4564 +4565 +4566 +4568 +4571 +4573 +4574 +4575 +4576 +4578 +4579 +4581 +4582 +4584 +4587 +4589 +4592 +4593 +4594 +4595 +4596 +4597 +4599 +4600 +4604 +4606 +4610 +4612 +4615 +4617 +4618 +4620 +4622 +4623 +4625 +4626 +4627 +4628 +4631 +4635 +4637 +4640 +4641 +4643 +4644 +4646 +4649 +4651 +4653 +4657 +4659 +4660 +4661 +4662 +4663 +4664 +4667 +4670 +4675 +4678 +4681 +4684 +4688 +4691 +4692 +4696 +4697 +4698 +4700 +4703 +4704 +4705 +4707 +4709 +4710 +4711 +4715 +4719 +4720 +4722 +4725 +4728 +4729 +4732 +4733 +4735 +4739 +4742 +4746 +4749 +4750 +4751 +4752 +4753 +4754 +4755 +4756 +4758 +4760 +4761 +4762 +4763 +4765 +4766 +4767 +4768 +4769 +4770 +4775 +4776 +4777 +4778 +4779 +4780 +4781 +4782 +4784 +4787 +4788 +4789 +4790 +4794 +4799 +4800 +4801 +4804 +4805 +4806 +4808 +4810 +4811 +4812 +4813 +4815 +4816 +4817 +4818 +4819 +4822 +4823 +4825 +4827 +4829 +4831 +4833 +4834 +4837 +4838 +4839 +4840 +4844 +4847 +4848 +4851 +4853 +4855 +4856 +4859 +4860 +4861 +4866 +4867 +4868 +4869 +4871 +4872 +4873 +4875 +4876 +4877 +4880 +4881 +4883 +4885 +4886 +4887 +4888 +4889 +4890 +4891 +4893 +4898 +4904 +4905 +4909 +4910 +4913 +4914 +4915 +4916 +4917 +4920 +4922 +4923 +4924 +4925 +4930 +4931 +4932 +4933 +4935 +4936 +4937 +4938 +4939 +4942 +4944 +4947 +4952 +4955 +4956 +4957 +4958 +4959 +4961 +4963 +4967 +4969 +4970 +4971 +4972 +4973 +4974 +4977 +4981 +4985 +4986 +4989 +4990 +4992 +4993 +4996 +4999 +5000 +5001 +5003 +5012 +5015 +5018 +5019 +5021 +5022 +5023 +5025 +5028 +5029 +5035 +5036 +5041 +5042 +5043 +5046 +5047 +5049 +5050 +5058 +5059 +5063 +5065 +5068 +5069 +5070 +5071 +5072 +5073 +5074 +5077 +5078 +5082 +5083 +5084 +5086 +5087 +5090 +5092 +5093 +5094 +5096 +5097 +5100 +5103 +5104 +5105 +5106 +5110 +5113 +5117 +5118 +5119 +5120 +5122 +5123 +5124 +5125 +5126 +5130 +5131 +5132 +5135 +5138 +5139 +5146 +5147 +5149 +5151 +5152 +5153 +5154 +5155 +5157 +5160 +5161 +5165 +5167 +5170 +5172 +5173 +5174 +5177 +5178 +5179 +5182 +5183 +5185 +5187 +5188 +5189 +5191 +5194 +5195 +5196 +5197 +5199 +5201 +5202 +5203 +5208 +5209 +5211 +5214 +5218 +5219 +5220 +5221 +5222 +5224 +5226 +5227 +5229 +5230 +5232 +5234 +5235 +5238 +5243 +5245 +5246 +5248 +5249 +5250 +5251 +5252 +5253 +5256 +5260 +5261 +5267 +5268 +5269 +5271 +5278 +5279 +5285 +5289 +5291 +5292 +5294 +5296 +5297 +5299 +5300 +5303 +5304 +5309 +5311 +5312 +5315 +5320 +5321 +5322 +5324 +5326 +5328 +5329 +5332 +5333 +5334 +5337 +5339 +5340 +5341 +5343 +5344 +5347 +5348 +5349 +5350 +5352 +5353 +5355 +5359 +5360 +5361 +5363 +5366 +5367 +5368 +5369 +5371 +5372 +5374 +5375 +5376 +5377 +5385 +5386 +5388 +5392 +5393 +5394 +5395 +5398 +5399 +5400 +5401 +5402 +5404 +5406 +5408 +5409 +5410 +5412 +5413 +5414 +5416 +5417 +5419 +5420 +5421 +5425 +5426 +5427 +5431 +5432 +5433 +5435 +5437 +5440 +5441 +5443 +5448 +5449 +5452 +5453 +5456 +5457 +5459 +5460 +5462 +5463 +5466 +5468 +5470 +5471 +5474 +5475 +5478 +5480 +5485 +5487 +5488 +5490 +5491 +5494 +5496 +5500 +5509 +5510 +5512 +5514 +5517 +5518 +5519 +5520 +5521 +5527 +5530 +5533 +5537 +5538 +5540 +5541 +5544 +5546 +5548 +5549 +5551 +5554 +5557 +5559 +5560 +5562 +5563 +5564 +5573 +5574 +5575 +5576 +5580 +5581 +5583 +5584 +5586 +5589 +5591 +5592 +5598 +5600 +5601 +5605 +5608 +5610 +5615 +5618 +5620 +5621 +5624 +5625 +5626 +5628 +5630 +5632 +5633 +5638 +5640 +5641 +5643 +5645 +5647 +5652 +5655 +5658 +5660 +5661 +5662 +5663 +5664 +5667 +5668 +5669 +5670 +5672 +5673 +5674 +5679 +5681 +5682 +5690 +5693 +5697 +5698 +5702 +5703 +5705 +5712 +5715 +5718 +5719 +5721 +5722 +5726 +5728 +5737 +5739 +5743 +5744 +5745 +5747 +5748 +5750 +5752 +5755 +5756 +5757 +5759 +5760 +5764 +5767 +5768 +5770 +5772 +5773 +5775 +5776 +5781 +5782 +5783 +5785 +5787 +5788 +5790 +5792 +5793 +5795 +5796 +5797 +5799 +5800 +5801 +5802 +5803 +5805 +5806 +5807 +5808 +5810 +5815 +5818 +5821 +5822 +5823 +5827 +5829 +5830 +5835 +5836 +5840 +5842 +5844 +5846 +5849 +5853 +5854 +5857 +5859 +5860 +5866 +5869 +5870 +5872 +5873 +5875 +5876 +5878 +5881 +5882 +5883 +5884 +5886 +5888 +5893 +5894 +5900 +5901 +5902 +5903 +5904 +5906 +5908 +5910 +5911 +5914 +5918 +5920 +5922 +5925 +5926 +5927 +5928 +5932 +5933 +5934 +5935 +5938 +5940 +5942 +5944 +5945 +5947 +5950 +5952 +5954 +5956 +5960 +5961 +5963 +5966 +5970 +5974 +5975 +5983 +5985 +5986 +5990 +5995 +5996 +5997 +5999 +6000 +6003 +6006 +6010 +6011 +6012 +6013 +6015 +6016 +6020 +6025 +6026 +6028 +6030 +6031 +6033 +6035 +6037 +6038 +6041 +6042 +6044 +6045 +6046 +6048 +6056 +6057 +6058 +6061 +6062 +6064 +6065 +6071 +6074 +6078 +6088 +6095 +6098 +6099 +6100 +6102 +6103 +6105 +6106 +6110 +6112 +6116 +6119 +6120 +6122 +6123 +6125 +6126 +6133 +6136 +6137 +6140 +6142 +6147 +6148 +6149 +6151 +6152 +6154 +6155 +6163 +6166 +6168 +6171 +6172 +6173 +6174 +6175 +6177 +6178 +6179 +6180 +6182 +6183 +6185 +6186 +6188 +6189 +6190 +6193 +6194 +6195 +6197 +6199 +6200 +6204 +6205 +6207 +6208 +6209 +6212 +6216 +6218 +6219 +6222 +6223 +6225 +6226 +6230 +6232 +6235 +6237 +6240 +6241 +6242 +6247 +6251 +6252 +6256 +6260 +6262 +6263 +6264 +6267 +6269 +6270 +6271 +6272 +6275 +6276 +6278 +6280 +6282 +6286 +6290 +6293 +6298 +6299 +6302 +6304 +6306 +6310 +6315 +6316 +6317 +6319 +6321 +6323 +6326 +6327 +6329 +6333 +6334 +6344 +6345 +6346 +6348 +6349 +6350 +6353 +6354 +6355 +6356 +6363 +6366 +6369 +6370 +6374 +6376 +6378 +6383 +6385 +6390 +6392 +6396 +6399 +6400 +6402 +6403 +6404 +6405 +6406 +6408 +6410 +6411 +6412 +6415 +6416 +6418 +6419 +6420 +6421 +6423 +6424 +6425 +6426 +6427 +6428 +6429 +6431 +6433 +6438 +6439 +6440 +6441 +6442 +6443 +6444 +6448 +6450 +6454 +6457 +6460 +6461 +6462 +6467 +6468 +6476 +6479 +6480 +6483 +6484 +6485 +6488 +6495 +6503 +6507 +6515 +6516 +6517 +6518 +6520 +6521 +6530 +6531 +6532 +6537 +6538 +6539 +6546 +6550 +6554 +6557 +6561 +6562 +6563 +6565 +6566 +6570 +6572 +6574 +6578 +6583 +6585 +6586 +6593 +6595 +6596 +6597 +6598 +6600 +6601 +6607 +6608 +6609 +6611 +6613 +6620 +6627 +6630 +6633 +6635 +6636 +6637 +6639 +6640 +6641 +6642 +6644 +6645 +6650 +6651 +6653 +6654 +6657 +6662 +6663 +6664 +6665 +6667 +6671 +6673 +6674 +6678 +6679 +6681 +6684 +6686 +6689 +6690 +6692 +6693 +6694 +6696 +6698 +6701 +6703 +6705 +6707 +6712 +6713 +6714 +6716 +6717 +6718 +6720 +6726 +6728 +6730 +6731 +6732 +6733 +6735 +6742 +6743 +6745 +6746 +6747 +6752 +6755 +6759 +6760 +6761 +6762 +6764 +6768 +6772 +6773 +6774 +6775 +6781 +6784 +6787 +6791 +6792 +6795 +6798 +6800 +6803 +6806 +6807 +6810 +6811 +6814 +6816 +6817 +6821 +6824 +6828 +6829 +6830 +6832 +6838 +6842 +6843 +6847 +6850 +6853 +6854 +6857 +6858 +6859 +6860 +6862 +6863 +6864 +6866 +6867 +6870 +6871 +6874 +6875 +6876 +6878 +6880 +6883 +6884 +6885 +6888 +6891 +6896 +6900 +6905 +6906 +6907 +6908 +6909 +6910 +6912 +6913 +6914 +6917 +6919 +6923 +6930 +6932 +6934 +6935 +6939 +6940 +6941 +6942 +6944 +6945 +6946 +6948 +6950 +6951 +6953 +6957 +6961 +6962 +6963 +6972 +6974 +6976 +6977 +6978 +6979 +6980 +6981 +6983 +6986 +6990 +6993 +6995 +6997 +7000 +7006 +7011 +7013 +7015 +7018 +7019 +7024 +7025 +7026 +7028 +7031 +7032 +7035 +7038 +7042 +7043 +7044 +7049 +7051 +7054 +7055 +7057 +7058 +7059 +7060 +7061 +7062 +7064 +7068 +7070 +7072 +7073 +7078 +7081 +7084 +7085 +7087 +7090 +7092 +7095 +7096 +7100 +7107 +7108 +7110 +7111 +7114 +7118 +7120 +7122 +7124 +7125 +7132 +7134 +7138 +7139 +7144 +7147 +7148 +7150 +7151 +7153 +7160 +7161 +7162 +7168 +7169 +7170 +7177 +7179 +7182 +7183 +7184 +7186 +7187 +7188 +7190 +7200 +7202 +7203 +7208 +7211 +7212 +7215 +7216 +7217 +7218 +7221 +7224 +7225 +7227 +7229 +7232 +7233 +7235 +7237 +7238 +7247 +7248 +7249 +7250 +7252 +7254 +7258 +7261 +7265 +7268 +7269 +7271 +7272 +7274 +7276 +7278 +7280 +7281 +7284 +7286 +7288 +7293 +7295 +7300 +7301 +7302 +7304 +7308 +7310 +7311 +7319 +7320 +7321 +7323 +7328 +7331 +7332 +7335 +7336 +7339 +7340 +7342 +7347 +7348 +7349 +7350 +7353 +7355 +7356 +7362 +7365 +7369 +7373 +7374 +7377 +7380 +7382 +7385 +7387 +7390 +7395 +7398 +7399 +7402 +7405 +7415 +7417 +7418 +7419 +7420 +7422 +7424 +7432 +7434 +7437 +7438 +7440 +7441 +7445 +7450 +7452 +7454 +7455 +7457 +7460 +7461 +7466 +7467 +7469 +7470 +7471 +7474 +7475 +7478 +7480 +7485 +7488 +7490 +7491 +7492 +7493 +7494 +7495 +7496 +7497 +7498 +7500 +7502 +7508 +7510 +7516 +7518 +7520 +7521 +7522 +7523 +7525 +7533 +7536 +7539 +7540 +7543 +7547 +7548 +7549 +7550 +7553 +7554 +7556 +7559 +7560 +7562 +7564 +7567 +7570 +7574 +7575 +7579 +7580 +7585 +7586 +7587 +7589 +7591 +7594 +7596 +7598 +7602 +7603 +7607 +7608 +7610 +7611 +7615 +7620 +7621 +7627 +7630 +7632 +7634 +7637 +7639 +7642 +7644 +7646 +7650 +7651 +7660 +7661 +7666 +7668 +7679 +7680 +7683 +7685 +7686 +7690 +7691 +7692 +7696 +7698 +7700 +7702 +7703 +7704 +7705 +7710 +7711 +7712 +7715 +7716 +7718 +7720 +7726 +7727 +7731 +7733 +7735 +7739 +7740 +7741 +7742 +7747 +7750 +7755 +7757 +7758 +7759 +7760 +7763 +7764 +7765 +7766 +7768 +7769 +7771 +7779 +7784 +7786 +7789 +7790 +7793 +7797 +7800 +7801 +7802 +7805 +7808 +7810 +7814 +7816 +7817 +7820 +7821 +7823 +7824 +7827 +7828 +7829 +7834 +7835 +7839 +7840 +7842 +7844 +7846 +7849 +7850 +7856 +7859 +7864 +7865 +7866 +7868 +7870 +7877 +7879 +7880 +7881 +7891 +7895 +7900 +7905 +7907 +7910 +7913 +7915 +7916 +7918 +7919 +7920 +7922 +7924 +7925 +7932 +7934 +7936 +7940 +7942 +7943 +7946 +7950 +7951 +7952 +7959 +7964 +7968 +7971 +7978 +7980 +7983 +7984 +7986 +7988 +7989 +7990 +7993 +8000 +8002 +8004 +8005 +8010 +8015 +8017 +8028 +8030 +8032 +8038 +8040 +8043 +8046 +8047 +8050 +8052 +8053 +8054 +8057 +8058 +8059 +8060 +8061 +8062 +8064 +8070 +8071 +8072 +8076 +8077 +8081 +8082 +8086 +8087 +8090 +8093 +8107 +8108 +8109 +8110 +8111 +8113 +8117 +8118 +8120 +8121 +8124 +8126 +8128 +8130 +8131 +8132 +8134 +8136 +8139 +8140 +8141 +8142 +8143 +8144 +8146 +8149 +8150 +8152 +8155 +8156 +8157 +8166 +8170 +8173 +8174 +8178 +8180 +8182 +8188 +8189 +8190 +8191 +8199 +8201 +8204 +8210 +8213 +8217 +8220 +8224 +8227 +8230 +8234 +8235 +8237 +8242 +8243 +8245 +8246 +8250 +8251 +8253 +8254 +8255 +8260 +8263 +8265 +8266 +8267 +8275 +8279 +8281 +8282 +8283 +8284 +8285 +8286 +8292 +8293 +8294 +8296 +8300 +8302 +8305 +8307 +8310 +8317 +8318 +8324 +8325 +8326 +8332 +8336 +8341 +8344 +8345 +8346 +8349 +8352 +8355 +8360 +8361 +8362 +8364 +8365 +8370 +8372 +8374 +8376 +8377 +8380 +8383 +8388 +8390 +8395 +8399 +8401 +8404 +8406 +8407 +8410 +8411 +8420 +8421 +8423 +8427 +8430 +8432 +8435 +8437 +8441 +8442 +8450 +8451 +8459 +8460 +8461 +8463 +8468 +8471 +8480 +8481 +8483 +8489 +8495 +8496 +8499 +8500 +8501 +8512 +8516 +8520 +8521 +8522 +8528 +8529 +8530 +8535 +8536 +8542 +8548 +8549 +8553 +8555 +8557 +8563 +8566 +8570 +8573 +8576 +8580 +8584 +8586 +8587 +8588 +8590 +8592 +8596 +8597 +8598 +8600 +8601 +8603 +8608 +8610 +8613 +8620 +8621 +8622 +8624 +8625 +8630 +8633 +8635 +8640 +8643 +8644 +8646 +8653 +8659 +8661 +8662 +8665 +8666 +8670 +8673 +8678 +8681 +8683 +8685 +8694 +8705 +8706 +8709 +8724 +8725 +8730 +8732 +8734 +8741 +8745 +8748 +8749 +8750 +8753 +8754 +8756 +8757 +8758 +8761 +8762 +8764 +8777 +8780 +8781 +8782 +8787 +8788 +8791 +8795 +8796 +8798 +8800 +8810 +8812 +8815 +8818 +8820 +8821 +8822 +8825 +8827 +8828 +8830 +8832 +8835 +8838 +8840 +8841 +8845 +8846 +8849 +8850 +8851 +8855 +8859 +8860 +8866 +8869 +8870 +8871 +8874 +8877 +8878 +8879 +8883 +8887 +8888 +8889 +8890 +8891 +8892 +8893 +8898 +8904 +8906 +8907 +8910 +8911 +8916 +8919 +8920 +8921 +8922 +8924 +8926 +8930 +8933 +8936 +8940 +8947 +8954 +8960 +8963 +8964 +8968 +8969 +8970 +8974 +8977 +8978 +8979 +8980 +8985 +8989 +8995 +8996 +9000 +9001 +9010 +9014 +9015 +9027 +9028 +9030 +9031 +9044 +9049 +9051 +9054 +9060 +9064 +9066 +9069 +9070 +9071 +9072 +9073 +9075 +9079 +9080 +9085 +9090 +9091 +9092 +9097 +9098 +9105 +9107 +9110 +9112 +9117 +9125 +9126 +9129 +9130 +9137 +9140 +9142 +9150 +9155 +9158 +9160 +9161 +9163 +9167 +9169 +9170 +9176 +9178 +9180 +9190 +9193 +9198 +9200 +9206 +9207 +9209 +9212 +9215 +9217 +9226 +9227 +9228 +9229 +9230 +9233 +9234 +9236 +9242 +9248 +9249 +9252 +9260 +9269 +9273 +9275 +9279 +9280 +9281 +9282 +9284 +9285 +9290 +9299 +9300 +9305 +9314 +9323 +9324 +9325 +9329 +9335 +9336 +9340 +9342 +9343 +9347 +9348 +9350 +9352 +9353 +9356 +9360 +9362 +9363 +9367 +9370 +9380 +9383 +9384 +9391 +9395 +9396 +9398 +9400 +9403 +9405 +9406 +9409 +9410 +9414 +9419 +9420 +9428 +9437 +9438 +9440 +9452 +9453 +9456 +9457 +9459 +9461 +9470 +9477 +9480 +9481 +9485 +9487 +9491 +9496 +9499 +9500 +9501 +9504 +9508 +9510 +9519 +9520 +9527 +9536 +9539 +9543 +9546 +9547 +9550 +9555 +9556 +9565 +9566 +9570 +9572 +9575 +9577 +9578 +9579 +9581 +9586 +9587 +9591 +9600 +9605 +9607 +9610 +9611 +9614 +9619 +9620 +9625 +9628 +9630 +9635 +9644 +9646 +9647 +9650 +9652 +9656 +9666 +9670 +9673 +9676 +9680 +9683 +9684 +9686 +9688 +9692 +9694 +9696 +9697 +9700 +9701 +9702 +9704 +9706 +9709 +9710 +9717 +9718 +9719 +9720 +9730 +9734 +9744 +9748 +9750 +9767 +9769 +9777 +9778 +9790 +9810 +9820 +9822 +9824 +9827 +9830 +9832 +9834 +9840 +9844 +9850 +9851 +9852 +9860 +9864 +9865 +9870 +9873 +9882 +9886 +9890 +9896 +9900 +9903 +9904 +9914 +9917 +9918 +9919 +9930 +9932 +9934 +9938 +9939 +9940 +9943 +9950 +9952 +9956 +9960 +9972 +9977 +9980 +9993 +9995 +10000 +10004 +10005 +10007 +10020 +10021 +10023 +10030 +10031 +10032 +10033 +10034 +10038 +10040 +10049 +10051 +10058 +10075 +10080 +10081 +10085 +10103 +10106 +10111 +10112 +10113 +10115 +10116 +10120 +10122 +10128 +10140 +10145 +10148 +10150 +10151 +10155 +10156 +10166 +10167 +10169 +10170 +10174 +10188 +10193 +10200 +10203 +10208 +10210 +10211 +10217 +10220 +10223 +10230 +10233 +10240 +10243 +10245 +10246 +10250 +10255 +10260 +10269 +10270 +10272 +10279 +10280 +10282 +10284 +10285 +10289 +10291 +10299 +10300 +10302 +10303 +10323 +10324 +10325 +10326 +10327 +10333 +10338 +10340 +10344 +10349 +10350 +10352 +10360 +10370 +10373 +10374 +10376 +10388 +10393 +10396 +10398 +10400 +10407 +10411 +10420 +10428 +10435 +10447 +10450 +10461 +10465 +10468 +10473 +10474 +10478 +10480 +10485 +10490 +10492 +10498 +10502 +10503 +10504 +10507 +10510 +10514 +10527 +10528 +10540 +10550 +10558 +10560 +10562 +10564 +10570 +10572 +10580 +10585 +10600 +10604 +10608 +10610 +10616 +10619 +10644 +10660 +10670 +10672 +10674 +10690 +10692 +10695 +10699 +10700 +10701 +10707 +10710 +10712 +10718 +10719 +10722 +10730 +10732 +10738 +10740 +10741 +10743 +10744 +10746 +10752 +10770 +10771 +10777 +10780 +10788 +10790 +10794 +10810 +10811 +10814 +10823 +10829 +10833 +10835 +10848 +10852 +10860 +10865 +10867 +10870 +10875 +10877 +10878 +10882 +10888 +10906 +10908 +10909 +10910 +10915 +10922 +10930 +10933 +10950 +10958 +10960 +10964 +10974 +10980 +10984 +10989 +10990 +10992 +10993 +10996 +11005 +11017 +11018 +11023 +11025 +11027 +11029 +11030 +11039 +11040 +11048 +11049 +11050 +11063 +11073 +11076 +11079 +11080 +11091 +11094 +11102 +11106 +11108 +11109 +11110 +11115 +11121 +11124 +11128 +11146 +11147 +11160 +11170 +11177 +11180 +11181 +11185 +11187 +11190 +11195 +11200 +11205 +11219 +11220 +11229 +11250 +11253 +11256 +11258 +11260 +11270 +11277 +11279 +11280 +11283 +11286 +11289 +11296 +11297 +11299 +11301 +11309 +11312 +11328 +11330 +11339 +11340 +11342 +11346 +11350 +11356 +11358 +11359 +11360 +11366 +11370 +11372 +11378 +11382 +11383 +11390 +11400 +11403 +11404 +11410 +11419 +11428 +11430 +11439 +11441 +11446 +11447 +11450 +11458 +11463 +11464 +11470 +11475 +11487 +11489 +11490 +11499 +11500 +11515 +11518 +11520 +11524 +11537 +11540 +11546 +11550 +11553 +11555 +11560 +11570 +11571 +11572 +11577 +11580 +11583 +11589 +11590 +11595 +11600 +11601 +11610 +11611 +11620 +11622 +11630 +11631 +11640 +11643 +11644 +11645 +11650 +11660 +11664 +11669 +11670 +11681 +11683 +11686 +11687 +11690 +11694 +11696 +11700 +11710 +11719 +11720 +11723 +11730 +11742 +11744 +11750 +11760 +11770 +11772 +11773 +11780 +11781 +11784 +11791 +11800 +11801 +11802 +11813 +11815 +11816 +11820 +11839 +11840 +11849 +11855 +11856 +11857 +11858 +11860 +11862 +11869 +11880 +11890 +11899 +11900 +11905 +11910 +11920 +11930 +11932 +11939 +11940 +11946 +11959 +11960 +11980 +11987 +12000 +12006 +12009 +12027 +12030 +12035 +12040 +12047 +12050 +12070 +12080 +12087 +12088 +12091 +12093 +12115 +12120 +12125 +12130 +12139 +12140 +12146 +12148 +12160 +12162 +12166 +12168 +12170 +12172 +12177 +12180 +12186 +12188 +12189 +12192 +12194 +12200 +12201 +12204 +12218 +12219 +12220 +12221 +12225 +12228 +12230 +12237 +12239 +12240 +12242 +12244 +12250 +12256 +12260 +12261 +12266 +12270 +12282 +12294 +12295 +12296 +12300 +12301 +12306 +12317 +12320 +12321 +12322 +12323 +12326 +12327 +12330 +12340 +12343 +12371 +12380 +12385 +12390 +12391 +12392 +12400 +12408 +12412 +12419 +12422 +12423 +12427 +12430 +12440 +12441 +12446 +12450 +12452 +12455 +12468 +12477 +12480 +12486 +12497 +12499 +12500 +12503 +12510 +12526 +12528 +12530 +12540 +12550 +12559 +12560 +12570 +12578 +12592 +12601 +12608 +12610 +12616 +12619 +12620 +12640 +12644 +12650 +12660 +12663 +12670 +12677 +12680 +12681 +12684 +12691 +12695 +12701 +12720 +12721 +12730 +12740 +12744 +12750 +12760 +12762 +12767 +12769 +12770 +12773 +12779 +12780 +12784 +12797 +12801 +12803 +12819 +12820 +12830 +12840 +12849 +12850 +12860 +12879 +12882 +12885 +12896 +12903 +12916 +12918 +12925 +12930 +12934 +12940 +12942 +12943 +12950 +12953 +12960 +12966 +12971 +12990 +12997 +13000 +13020 +13025 +13026 +13030 +13040 +13041 +13043 +13049 +13053 +13059 +13060 +13061 +13067 +13070 +13072 +13100 +13111 +13120 +13124 +13126 +13139 +13140 +13160 +13171 +13178 +13179 +13180 +13191 +13198 +13200 +13215 +13226 +13230 +13231 +13244 +13246 +13260 +13269 +13270 +13272 +13274 +13276 +13280 +13288 +13300 +13305 +13309 +13312 +13324 +13330 +13333 +13340 +13350 +13351 +13360 +13361 +13367 +13368 +13370 +13372 +13380 +13382 +13388 +13410 +13422 +13427 +13430 +13440 +13450 +13466 +13470 +13475 +13480 +13490 +13491 +13495 +13504 +13508 +13509 +13517 +13520 +13539 +13540 +13546 +13548 +13559 +13560 +13562 +13570 +13574 +13580 +13587 +13590 +13591 +13592 +13594 +13596 +13598 +13599 +13610 +13620 +13622 +13626 +13632 +13635 +13639 +13641 +13650 +13660 +13674 +13680 +13683 +13688 +13690 +13700 +13704 +13709 +13710 +13720 +13722 +13730 +13733 +13740 +13744 +13750 +13758 +13760 +13782 +13786 +13792 +13800 +13813 +13820 +13821 +13832 +13833 +13841 +13849 +13860 +13866 +13870 +13873 +13877 +13880 +13882 +13890 +13900 +13905 +13907 +13929 +13936 +13954 +13970 +13980 +14010 +14038 +14040 +14041 +14043 +14050 +14056 +14060 +14068 +14080 +14087 +14090 +14091 +14092 +14100 +14106 +14110 +14114 +14120 +14130 +14132 +14146 +14154 +14157 +14168 +14172 +14173 +14177 +14190 +14193 +14205 +14215 +14223 +14230 +14234 +14238 +14250 +14253 +14255 +14260 +14278 +14285 +14300 +14311 +14318 +14320 +14321 +14324 +14331 +14342 +14343 +14358 +14364 +14369 +14370 +14382 +14390 +14396 +14400 +14402 +14410 +14420 +14430 +14440 +14442 +14444 +14460 +14480 +14491 +14499 +14504 +14510 +14513 +14520 +14530 +14540 +14545 +14550 +14560 +14570 +14578 +14580 +14591 +14597 +14600 +14605 +14607 +14610 +14618 +14622 +14625 +14629 +14630 +14650 +14654 +14675 +14679 +14689 +14690 +14698 +14700 +14705 +14710 +14717 +14720 +14729 +14746 +14750 +14757 +14759 +14760 +14773 +14780 +14789 +14790 +14795 +14800 +14803 +14811 +14819 +14820 +14821 +14823 +14828 +14850 +14860 +14894 +14895 +14898 +14900 +14901 +14905 +14910 +14916 +14938 +14940 +14960 +14970 +14997 +14998 +15007 +15025 +15030 +15038 +15040 +15052 +15053 +15060 +15070 +15074 +15077 +15080 +15090 +15110 +15133 +15136 +15137 +15144 +15147 +15150 +15157 +15160 +15170 +15175 +15178 +15190 +15200 +15201 +15210 +15213 +15220 +15249 +15250 +15260 +15261 +15267 +15278 +15283 +15289 +15292 +15296 +15300 +15311 +15321 +15331 +15340 +15355 +15365 +15375 +15380 +15381 +15400 +15408 +15409 +15418 +15424 +15425 +15430 +15434 +15440 +15445 +15460 +15470 +15479 +15480 +15484 +15490 +15492 +15502 +15505 +15507 +15520 +15527 +15530 +15549 +15554 +15563 +15567 +15570 +15579 +15582 +15584 +15600 +15610 +15630 +15645 +15649 +15650 +15652 +15655 +15665 +15670 +15684 +15685 +15688 +15690 +15695 +15730 +15734 +15747 +15756 +15765 +15779 +15782 +15792 +15817 +15823 +15828 +15830 +15836 +15840 +15845 +15849 +15860 +15870 +15871 +15876 +15877 +15880 +15892 +15901 +15907 +15930 +15954 +16000 +16003 +16004 +16010 +16040 +16050 +16054 +16072 +16080 +16090 +16100 +16109 +16110 +16119 +16125 +16150 +16162 +16184 +16190 +16198 +16210 +16214 +16230 +16234 +16240 +16245 +16246 +16258 +16273 +16280 +16282 +16290 +16300 +16310 +16311 +16321 +16329 +16330 +16332 +16343 +16362 +16369 +16377 +16379 +16382 +16383 +16386 +16387 +16390 +16399 +16400 +16432 +16433 +16440 +16450 +16453 +16463 +16470 +16490 +16493 +16495 +16500 +16506 +16510 +16511 +16520 +16540 +16543 +16557 +16560 +16570 +16573 +16579 +16599 +16610 +16617 +16628 +16645 +16647 +16648 +16660 +16661 +16664 +16668 +16670 +16678 +16680 +16683 +16687 +16718 +16720 +16723 +16737 +16762 +16772 +16777 +16780 +16783 +16784 +16792 +16797 +16800 +16805 +16814 +16820 +16831 +16837 +16840 +16853 +16857 +16858 +16860 +16865 +16867 +16869 +16892 +16894 +16900 +16916 +16920 +16923 +16925 +16927 +16930 +16948 +16950 +16980 +16990 +17000 +17010 +17020 +17035 +17045 +17071 +17080 +17085 +17087 +17090 +17091 +17100 +17129 +17130 +17139 +17142 +17148 +17160 +17166 +17196 +17200 +17203 +17218 +17220 +17230 +17232 +17240 +17248 +17260 +17267 +17271 +17303 +17310 +17317 +17319 +17330 +17346 +17350 +17360 +17370 +17380 +17390 +17402 +17406 +17410 +17411 +17412 +17440 +17449 +17464 +17465 +17466 +17470 +17480 +17490 +17500 +17520 +17525 +17536 +17540 +17542 +17550 +17570 +17579 +17580 +17585 +17600 +17610 +17615 +17617 +17620 +17626 +17629 +17640 +17644 +17650 +17666 +17673 +17680 +17690 +17700 +17722 +17740 +17750 +17760 +17770 +17780 +17790 +17798 +17800 +17825 +17867 +17880 +17900 +17928 +17930 +17932 +17940 +17950 +17955 +17957 +17960 +17977 +17987 +17990 +18000 +18012 +18022 +18030 +18050 +18054 +18060 +18080 +18081 +18086 +18090 +18099 +18110 +18130 +18134 +18144 +18150 +18151 +18157 +18162 +18170 +18171 +18204 +18220 +18223 +18242 +18252 +18260 +18283 +18286 +18288 +18289 +18300 +18302 +18320 +18330 +18336 +18337 +18339 +18356 +18360 +18380 +18390 +18400 +18420 +18427 +18428 +18430 +18451 +18456 +18468 +18469 +18487 +18490 +18510 +18513 +18515 +18517 +18522 +18530 +18546 +18582 +18591 +18599 +18600 +18616 +18620 +18627 +18643 +18650 +18670 +18672 +18720 +18731 +18732 +18740 +18770 +18774 +18778 +18780 +18782 +18790 +18797 +18810 +18828 +18839 +18850 +18860 +18891 +18900 +18908 +18910 +18919 +18920 +18921 +18924 +18930 +18939 +18960 +18975 +18980 +18990 +19000 +19004 +19016 +19028 +19029 +19030 +19032 +19040 +19050 +19062 +19090 +19095 +19100 +19110 +19139 +19149 +19167 +19170 +19176 +19180 +19182 +19189 +19191 +19200 +19208 +19220 +19230 +19260 +19261 +19270 +19290 +19295 +19300 +19306 +19308 +19310 +19312 +19317 +19318 +19340 +19341 +19346 +19360 +19370 +19371 +19400 +19405 +19410 +19413 +19420 +19442 +19446 +19450 +19465 +19470 +19480 +19490 +19494 +19500 +19510 +19520 +19525 +19542 +19554 +19570 +19595 +19609 +19610 +19612 +19620 +19622 +19644 +19658 +19667 +19680 +19696 +19709 +19720 +19740 +19743 +19770 +19782 +19795 +19798 +19807 +19830 +19836 +19840 +19841 +19850 +19883 +19885 +19896 +19900 +19910 +19913 +19920 +19971 +19989 +19990 +20000 +20008 +20042 +20050 +20065 +20090 +20105 +20110 +20114 +20117 +20119 +20130 +20167 +20180 +20184 +20190 +20195 +20198 +20200 +20210 +20233 +20244 +20250 +20251 +20285 +20299 +20305 +20320 +20328 +20330 +20360 +20381 +20422 +20428 +20442 +20448 +20450 +20453 +20460 +20470 +20480 +20490 +20504 +20520 +20561 +20571 +20580 +20610 +20620 +20627 +20633 +20634 +20640 +20670 +20683 +20687 +20690 +20730 +20731 +20750 +20760 +20763 +20778 +20800 +20820 +20830 +20840 +20841 +20850 +20851 +20853 +20870 +20883 +20906 +20927 +20931 +20935 +20940 +20958 +20970 +20980 +20987 +20993 +21005 +21017 +21039 +21040 +21043 +21062 +21064 +21066 +21070 +21100 +21108 +21118 +21119 +21120 +21130 +21133 +21142 +21155 +21168 +21170 +21180 +21188 +21200 +21221 +21245 +21248 +21270 +21280 +21287 +21297 +21305 +21309 +21312 +21320 +21330 +21369 +21372 +21376 +21380 +21390 +21415 +21440 +21455 +21468 +21470 +21480 +21513 +21530 +21550 +21560 +21599 +21600 +21621 +21638 +21672 +21687 +21688 +21690 +21702 +21710 +21726 +21767 +21800 +21803 +21815 +21820 +21824 +21827 +21830 +21837 +21846 +21870 +21883 +21904 +21909 +21936 +21940 +21972 +21991 +21995 +22000 +22001 +22010 +22038 +22050 +22054 +22080 +22098 +22100 +22105 +22108 +22130 +22136 +22160 +22180 +22192 +22212 +22215 +22240 +22250 +22260 +22262 +22267 +22269 +22274 +22277 +22280 +22290 +22293 +22317 +22320 +22337 +22341 +22420 +22421 +22500 +22510 +22516 +22530 +22532 +22571 +22580 +22600 +22608 +22648 +22654 +22657 +22659 +22660 +22690 +22700 +22710 +22716 +22720 +22730 +22758 +22760 +22764 +22765 +22766 +22770 +22780 +22789 +22800 +22801 +22820 +22830 +22844 +22848 +22875 +22881 +22888 +22890 +22900 +22903 +22920 +22927 +22960 +22970 +22989 +22990 +23000 +23045 +23050 +23100 +23118 +23119 +23120 +23121 +23123 +23124 +23133 +23180 +23200 +23220 +23221 +23238 +23240 +23246 +23260 +23280 +23320 +23330 +23334 +23350 +23395 +23398 +23410 +23427 +23450 +23480 +23500 +23501 +23505 +23510 +23517 +23540 +23545 +23556 +23561 +23579 +23600 +23602 +23606 +23620 +23630 +23694 +23696 +23699 +23710 +23740 +23742 +23750 +23760 +23767 +23779 +23804 +23822 +23837 +23841 +23862 +23870 +23878 +23880 +23900 +23906 +23920 +23930 +24000 +24016 +24020 +24047 +24079 +24093 +24120 +24139 +24140 +24141 +24145 +24160 +24161 +24175 +24180 +24200 +24210 +24220 +24227 +24250 +24278 +24280 +24298 +24300 +24306 +24317 +24353 +24356 +24386 +24400 +24410 +24420 +24457 +24461 +24465 +24500 +24501 +24510 +24528 +24583 +24610 +24637 +24657 +24690 +24740 +24761 +24770 +24796 +24798 +24800 +24809 +24826 +24830 +24840 +24841 +24843 +24855 +24869 +24880 +24890 +24893 +24900 +24906 +24911 +24915 +24958 +24960 +24990 +25038 +25040 +25057 +25060 +25090 +25092 +25160 +25168 +25180 +25187 +25200 +25233 +25250 +25260 +25286 +25300 +25306 +25310 +25320 +25330 +25343 +25380 +25430 +25432 +25440 +25460 +25470 +25486 +25500 +25517 +25519 +25533 +25540 +25544 +25560 +25575 +25580 +25584 +25590 +25596 +25630 +25640 +25650 +25690 +25700 +25710 +25720 +25740 +25746 +25770 +25780 +25800 +25806 +25810 +25834 +25840 +25858 +25860 +25863 +25868 +25877 +25878 +25880 +25910 +25930 +25931 +25950 +25962 +25964 +25980 +26034 +26078 +26110 +26120 +26127 +26134 +26140 +26166 +26167 +26188 +26190 +26210 +26251 +26302 +26330 +26338 +26350 +26421 +26453 +26469 +26470 +26490 +26500 +26528 +26546 +26550 +26554 +26600 +26637 +26638 +26650 +26660 +26664 +26670 +26680 +26690 +26700 +26770 +26772 +26782 +26790 +26800 +26808 +26820 +26822 +26840 +26860 +26861 +26870 +26940 +26980 +26989 +27008 +27030 +27071 +27080 +27081 +27091 +27100 +27106 +27110 +27143 +27152 +27159 +27176 +27180 +27188 +27200 +27212 +27226 +27251 +27269 +27280 +27290 +27307 +27317 +27330 +27350 +27360 +27381 +27386 +27389 +27390 +27391 +27410 +27432 +27440 +27442 +27450 +27463 +27500 +27520 +27527 +27548 +27557 +27590 +27624 +27627 +27647 +27658 +27667 +27670 +27680 +27700 +27730 +27780 +27796 +27800 +27804 +27809 +27840 +27850 +27852 +27861 +27894 +27920 +27922 +27923 +27948 +27980 +27983 +27988 +27990 +28000 +28019 +28020 +28060 +28087 +28090 +28100 +28110 +28140 +28148 +28160 +28163 +28164 +28229 +28251 +28261 +28273 +28290 +28300 +28305 +28351 +28353 +28389 +28400 +28425 +28430 +28500 +28510 +28520 +28530 +28531 +28550 +28580 +28658 +28700 +28780 +28783 +28791 +28823 +28832 +28856 +28860 +28872 +28900 +28928 +28929 +28947 +28953 +28980 +29000 +29030 +29067 +29080 +29090 +29100 +29130 +29144 +29150 +29160 +29175 +29180 +29192 +29200 +29203 +29240 +29280 +29290 +29300 +29320 +29360 +29385 +29455 +29456 +29459 +29470 +29480 +29492 +29500 +29519 +29540 +29580 +29585 +29590 +29619 +29648 +29652 +29660 +29666 +29699 +29710 +29736 +29740 +29761 +29776 +29831 +29848 +29852 +29860 +29870 +29900 +29930 +29961 +30012 +30040 +30070 +30100 +30109 +30111 +30150 +30200 +30252 +30261 +30271 +30280 +30290 +30300 +30312 +30320 +30340 +30354 +30400 +30417 +30420 +30433 +30452 +30472 +30490 +30495 +30510 +30523 +30563 +30590 +30592 +30600 +30660 +30680 +30690 +30710 +30720 +30722 +30736 +30738 +30750 +30770 +30821 +30894 +30900 +30912 +30920 +30930 +30970 +30998 +31000 +31013 +31028 +31048 +31060 +31080 +31087 +31090 +31092 +31100 +31114 +31118 +31160 +31169 +31197 +31200 +31222 +31230 +31240 +31248 +31282 +31300 +31316 +31320 +31330 +31352 +31355 +31370 +31394 +31400 +31420 +31430 +31440 +31475 +31476 +31500 +31510 +31542 +31549 +31550 +31580 +31600 +31612 +31614 +31629 +31655 +31682 +31694 +31700 +31740 +31780 +31830 +31835 +31848 +31850 +31860 +31870 +31900 +31903 +31952 +32010 +32050 +32099 +32100 +32117 +32130 +32173 +32200 +32210 +32223 +32250 +32256 +32298 +32300 +32310 +32330 +32335 +32360 +32370 +32380 +32390 +32400 +32440 +32460 +32470 +32480 +32537 +32630 +32656 +32680 +32695 +32764 +32800 +32808 +32812 +32820 +32840 +32867 +32870 +32900 +32940 +33050 +33066 +33075 +33100 +33130 +33144 +33150 +33160 +33170 +33220 +33270 +33280 +33300 +33303 +33380 +33390 +33400 +33410 +33480 +33485 +33496 +33500 +33530 +33590 +33600 +33650 +33700 +33736 +33770 +33797 +33810 +33900 +33951 +33960 +34020 +34027 +34040 +34058 +34059 +34074 +34088 +34100 +34132 +34140 +34177 +34190 +34194 +34200 +34207 +34208 +34236 +34240 +34270 +34280 +34284 +34300 +34310 +34381 +34400 +34414 +34453 +34481 +34485 +34500 +34520 +34544 +34546 +34631 +34700 +34748 +34773 +34780 +34820 +34857 +34910 +34956 +34962 +35000 +35010 +35067 +35070 +35080 +35087 +35100 +35117 +35132 +35140 +35204 +35277 +35281 +35283 +35289 +35300 +35310 +35320 +35353 +35370 +35410 +35440 +35470 +35512 +35520 +35548 +35569 +35579 +35600 +35610 +35616 +35650 +35680 +35700 +35760 +35771 +35785 +35820 +35833 +35849 +35870 +35880 +35900 +35910 +36000 +36061 +36080 +36103 +36122 +36180 +36200 +36212 +36277 +36290 +36300 +36393 +36400 +36500 +36559 +36560 +36570 +36585 +36630 +36670 +36690 +36710 +36717 +36750 +36870 +36890 +36900 +36940 +36946 +36956 +36980 +37000 +37030 +37062 +37113 +37120 +37122 +37130 +37170 +37294 +37297 +37300 +37340 +37385 +37400 +37410 +37440 +37466 +37480 +37530 +37549 +37550 +37600 +37613 +37670 +37750 +37754 +37760 +37770 +37830 +37865 +37883 +37885 +37890 +37905 +37940 +37950 +38090 +38120 +38124 +38160 +38172 +38192 +38194 +38200 +38210 +38216 +38236 +38249 +38250 +38260 +38272 +38280 +38290 +38300 +38320 +38360 +38362 +38419 +38451 +38492 +38523 +38548 +38575 +38580 +38595 +38620 +38640 +38661 +38680 +38717 +38744 +38841 +39055 +39070 +39125 +39140 +39220 +39258 +39300 +39322 +39331 +39365 +39405 +39590 +39593 +39615 +39640 +39662 +39673 +39722 +39740 +39743 +39750 +39760 +39770 +39794 +39799 +39800 +39863 +39900 +39937 +39940 +39943 +39960 +39989 +40000 +40010 +40040 +40096 +40100 +40180 +40200 +40211 +40217 +40220 +40280 +40330 +40352 +40360 +40400 +40500 +40550 +40597 +40606 +40617 +40660 +40675 +40730 +40746 +40800 +40820 +40830 +40998 +41000 +41010 +41020 +41040 +41067 +41077 +41105 +41131 +41138 +41218 +41239 +41240 +41389 +41390 +41420 +41432 +41491 +41500 +41550 +41600 +41620 +41628 +41630 +41664 +41690 +41710 +41740 +41773 +41800 +41816 +41838 +41980 +41987 +41990 +42047 +42121 +42134 +42200 +42253 +42350 +42383 +42410 +42434 +42440 +42510 +42520 +42550 +42570 +42611 +42636 +42640 +42682 +42700 +42800 +42861 +42900 +42940 +42980 +43060 +43080 +43090 +43100 +43140 +43170 +43181 +43231 +43240 +43265 +43351 +43376 +43400 +43411 +43423 +43520 +43540 +43549 +43600 +43627 +43670 +43790 +43803 +43810 +43844 +43917 +43920 +44000 +44050 +44100 +44110 +44160 +44170 +44240 +44290 +44300 +44330 +44360 +44370 +44409 +44420 +44426 +44470 +44600 +44620 +44627 +44630 +44640 +44691 +44810 +44860 +44870 +44900 +44940 +45020 +45110 +45191 +45210 +45310 +45400 +45450 +45454 +45488 +45490 +45618 +45633 +45680 +45700 +45710 +45715 +45743 +45832 +45854 +45855 +45966 +46080 +46111 +46142 +46200 +46222 +46236 +46303 +46317 +46340 +46390 +46400 +46484 +46500 +46510 +46600 +46640 +46667 +46750 +46800 +46890 +46900 +47000 +47087 +47090 +47100 +47150 +47154 +47206 +47400 +47597 +47600 +47620 +47640 +47647 +47665 +47701 +47760 +47800 +47840 +47880 +48000 +48100 +48150 +48180 +48190 +48200 +48360 +48406 +48440 +48460 +48520 +48540 +48600 +48700 +48779 +48877 +48970 +48980 +49000 +49040 +49060 +49090 +49097 +49160 +49450 +49521 +49522 +49527 +49600 +49660 +49663 +49700 +49800 +49810 +49869 +49875 +49910 +50000 +50032 +50047 +50050 +50065 +50123 +50198 +50200 +50240 +50400 +50407 +50420 +50498 +50540 +50626 +50640 +50687 +50700 +50709 +50785 +50900 +50980 +50990 +51020 +51100 +51166 +51170 +51173 +51247 +51257 +51300 +51320 +51410 +51480 +51557 +51563 +51600 +51680 +51710 +51722 +51740 +51753 +51770 +51780 +51900 +51970 +52000 +52065 +52200 +52254 +52320 +52400 +52500 +52530 +52600 +52601 +52640 +52720 +52750 +52800 +52864 +52874 +52899 +52900 +53000 +53070 +53128 +53180 +53213 +53240 +53550 +53670 +53695 +53745 +53800 +53806 +53810 +53812 +53838 +53946 +54000 +54010 +54100 +54140 +54200 +54299 +54310 +54550 +54570 +54750 +54782 +54822 +54867 +54888 +54910 +54980 +54990 +55100 +55200 +55241 +55300 +55310 +55340 +55405 +55460 +55600 +55690 +55750 +55790 +55800 +55900 +56000 +56037 +56052 +56130 +56200 +56255 +56300 +56363 +56500 +56600 +56605 +56629 +56630 +56800 +56854 +56890 +57000 +57170 +57183 +57200 +57310 +57351 +57352 +57470 +57492 +57540 +57550 +57600 +57603 +57668 +57693 +57711 +57940 +57950 +58014 +58047 +58080 +58100 +58200 +58250 +58300 +58364 +58460 +58560 +58600 +58630 +58646 +58700 +58760 +58970 +59000 +59081 +59100 +59170 +59300 +59408 +59443 +59460 +59472 +59546 +59580 +59600 +59680 +59760 +59800 +59810 +59820 +59914 +59917 +59934 +59992 +60000 +60114 +60149 +60250 +60280 +60330 +60335 +60400 +60420 +60450 +60455 +60520 +60522 +60610 +60720 +60756 +60776 +60890 +60961 +61000 +61015 +61020 +61040 +61119 +61220 +61250 +61260 +61269 +61270 +61292 +61400 +61476 +61540 +61568 +61640 +61648 +61690 +61700 +61780 +61787 +61851 +61852 +61890 +62090 +62100 +62160 +62240 +62320 +62440 +62500 +62610 +62750 +62800 +62900 +63000 +63140 +63200 +63300 +63424 +63430 +63600 +63710 +63760 +63880 +63890 +63940 +64020 +64025 +64090 +64161 +64210 +64300 +64400 +64500 +64581 +64600 +64604 +64628 +64648 +64680 +64708 +64736 +64740 +64779 +64782 +64890 +64910 +64994 +65000 +65013 +65221 +65240 +65250 +65265 +65300 +65330 +65440 +65496 +65600 +65700 +65790 +65800 +65893 +65900 +65979 +66070 +66100 +66140 +66190 +66200 +66231 +66265 +66327 +66430 +66490 +66500 +66510 +66647 +66689 +66780 +66800 +66847 +66850 +66900 +67000 +67170 +67175 +67180 +67200 +67251 +67300 +67370 +67400 +67500 +67649 +67700 +67723 +67780 +67800 +67936 +68160 +68189 +68200 +68302 +68470 +68517 +68570 +68640 +68700 +68710 +68763 +68776 +68840 +68894 +68900 +68950 +69000 +69131 +69180 +69306 +69406 +69430 +69468 +69520 +69540 +69553 +69600 +69608 +69610 +69880 +69890 +70100 +70143 +70190 +70200 +70220 +70300 +70390 +70400 +70500 +70659 +70760 +70791 +71019 +71300 +71450 +71474 +71600 +71700 +71749 +71800 +71870 +71900 +71930 +72040 +72188 +72213 +72270 +72416 +72440 +72453 +72475 +72500 +72580 +72710 +72800 +73045 +73101 +73134 +73219 +73326 +73354 +73430 +73451 +73710 +73718 +73790 +73803 +73890 +73900 +73950 +74200 +74349 +74388 +74565 +74580 +74597 +74900 +75000 +75025 +75100 +75152 +75210 +75230 +75295 +75322 +75477 +75500 +75700 +75779 +75780 +75810 +75991 +76020 +76040 +76150 +76400 +76430 +76700 +76769 +77000 +77100 +77195 +77200 +77252 +77330 +77342 +77542 +77600 +77700 +78000 +78003 +78100 +78160 +78207 +78300 +78340 +78400 +78596 +78782 +78870 +79038 +79091 +79100 +79180 +79210 +79300 +79500 +79620 +79622 +79862 +79900 +79960 +80000 +80500 +80570 +80580 +80600 +80622 +80720 +80781 +81000 +81200 +81230 +81280 +81454 +81600 +81700 +81740 +81832 +81855 +81900 +81930 +82000 +82090 +82400 +82500 +82560 +82600 +82650 +82700 +82720 +83013 +83130 +83214 +83370 +83500 +83570 +83580 +83590 +83620 +83720 +83765 +83845 +83919 +84000 +84008 +84052 +84124 +84343 +84490 +84570 +84979 +85000 +85320 +85540 +85560 +85682 +86040 +86175 +86200 +86250 +86570 +86599 +86907 +86920 +87000 +87050 +87198 +87305 +87400 +87420 +87500 +87840 +87900 +88000 +88100 +88150 +88170 +88200 +89089 +89400 +89427 +89500 +89700 +89751 +89830 +89860 +90000 +90105 +90300 +90419 +90500 +90580 +90600 +91022 +91114 +91168 +91200 +91248 +91280 +91400 +91850 +91865 +91900 +92000 +92200 +92400 +92624 +92643 +92800 +92813 +92870 +92980 +93000 +93080 +93200 +93389 +93499 +93600 +93769 +93790 +94000 +94070 +94130 +94300 +94560 +94569 +94600 +94700 +94930 +95052 +95100 +95160 +95200 +95253 +95300 +95390 +95450 +95580 +95780 +95800 +96000 +96100 +96138 +96200 +96300 +96430 +96495 +96660 +96969 +97150 +97216 +97300 +97400 +97440 +97630 +97640 +97735 +97800 +97832 +97930 +98000 +98100 +98184 +98200 +98660 +98700 +98759 +99000 +99177 +99266 +99297 +99453 +99486 +99851 +99940 +100000 +100100 +100400 +100500 +100880 +100900 +100933 +101200 +101300 +101400 +101480 +101652 +101666 +101685 +101989 +102000 +102200 +102300 +102652 +102787 +103000 +103200 +103240 +103990 +104642 +104871 +105000 +105075 +105200 +105399 +105535 +105740 +105950 +106000 +106160 +106167 +106500 +106772 +107370 +107400 +107401 +107414 +107603 +107657 +107700 +107701 +107986 +108000 +108190 +108200 +108216 +108278 +108490 +108761 +109000 +109020 +109120 +109130 +109361 +110500 +110691 +110940 +110950 +110970 +111460 +111500 +111766 +111800 +111919 +112000 +112450 +112855 +112890 +113170 +113233 +113260 +113689 +113709 +114000 +114030 +114050 +114200 +114446 +114500 +114580 +114726 +114770 +114800 +114902 +115400 +115470 +116000 +117000 +117500 +117605 +117690 +117692 +118000 +118228 +119000 +119782 +119830 +119960 +120151 +120325 +120400 +120600 +120814 +120980 +121246 +121300 +121482 +121813 +121911 +122100 +122120 +122490 +122848 +123360 +123924 +125000 +125294 +125853 +126000 +126100 +126400 +126510 +126581 +126600 +126699 +127400 +128000 +128700 +128760 +128910 +129000 +129100 +129300 +129500 +129600 +129640 +129700 +129930 +130000 +130180 +130200 +130494 +130700 +131000 +131461 +131478 +131491 +131500 +131600 +132000 +132100 +132500 +132900 +134000 +134400 +134900 +135000 +135050 +136000 +136520 +136600 +136790 +137000 +137100 +137700 +138000 +138700 +139000 +139437 +139704 +140000 +140440 +140458 +140540 +140751 +141100 +141119 +141458 +141810 +142000 +142300 +142551 +142842 +143600 +143884 +143900 +144040 +144995 +145345 +146517 +147404 +147705 +148230 +148303 +148810 +149000 +149073 +149900 +150000 +150551 +150564 +150700 +151000 +151691 +152000 +152023 +153000 +153200 +154000 +154200 +155310 +155909 +156000 +156360 +156490 +156530 +156780 +156900 +157540 +157560 +158000 +158425 +159085 +159352 +160410 +161610 +161700 +161800 +162000 +162100 +162200 +162893 +163700 +163830 +164000 +164068 +164200 +164684 +165000 +166030 +166165 +166400 +167000 +167178 +168146 +168700 +169636 +169910 +170000 +170010 +170701 +172000 +172075 +172189 +172235 +173479 +173975 +174200 +174390 +174530 +175120 +175591 +175640 +175900 +176431 +177000 +177500 +178082 +178400 +179000 +179390 +179900 +180000 +180107 +180590 +180610 +180900 +181000 +181300 +181490 +182000 +182990 +183000 +183050 +183100 +183940 +185554 +186150 +186800 +189000 +189100 +189670 +190000 +191538 +191600 +191781 +192500 +192740 +193200 +193820 +194040 +194309 +194790 +195504 +195824 +195937 +196000 +196110 +196687 +196775 +197000 +197200 +198000 +199950 +200000 +200222 +201270 +201900 +202000 +202562 +203432 +204020 +204390 +204800 +205000 +206000 +206110 +206900 +207000 +207148 +207990 +208150 +209000 +209400 +209678 +210000 +210400 +210684 +211900 +212000 +212200 +212700 +214290 +214330 +214700 +215227 +215500 +216000 +216200 +217000 +217400 +217410 +217700 +218000 +219070 +219552 +220000 +220970 +221000 +221690 +222913 +224000 +224900 +224918 +225000 +225490 +226712 +226833 +227000 +228780 +230000 +230400 +234150 +234410 +236205 +236804 +237300 +237700 +238800 +238860 +239003 +240431 +241048 +242700 +243794 +244794 +245700 +246000 +246190 +247000 +248000 +248240 +249500 +250000 +250300 +250400 +250929 +251579 +253200 +253750 +257100 +257600 +258000 +258123 +258681 +259000 +259570 +260160 +261200 +261750 +261800 +263813 +264600 +264800 +265240 +265500 +265900 +266757 +267120 +267346 +268500 +268742 +269240 +270000 +270132 +271000 +274000 +275000 +275100 +276500 +276620 +279101 +279140 +280000 +281490 +283570 +284702 +287300 +288440 +290000 +292304 +294000 +294862 +296480 +296600 +299800 +300000 +300500 +301120 +301232 +301868 +302000 +302114 +302270 +306000 +307130 +310000 +310240 +310396 +313591 +316000 +317500 +317741 +318000 +318731 +318999 +319531 +319800 +320000 +320300 +321000 +322520 +323214 +327300 +330000 +335480 +338740 +339000 +339160 +340000 +341930 +342000 +342300 +345410 +346030 +349000 +349700 +350000 +350530 +352000 +353170 +353220 +356000 +357300 +358000 +358170 +362920 +363000 +365000 +370000 +373400 +380000 +382310 +383500 +384220 +385685 +394610 +400000 +402000 +403900 +403942 +404987 +407139 +412130 +412200 +413443 +420400 +423057 +429600 +430480 +431780 +433000 +433150 +433530 +440000 +440700 +442480 +445190 +449046 +450000 +452000 +455509 +463000 +464000 +464730 +468200 +471840 +472850 +472900 +474310 +476780 +477600 +478552 +480000 +485000 +491000 +491500 +493700 +499598 +499700 +500000 +503100 +506005 +506270 +507000 +511700 +514592 +521640 +527620 +530000 +531898 +532400 +539900 +540000 +540126 +545400 +545586 +547638 +550000 +552171 +553500 +555717 +561127 +566690 +571000 +572210 +580000 +583380 +587000 +587860 +589400 +590400 +597000 +600000 +614000 +616000 +617290 +620000 +625000 +630000 +634000 +637422 +643030 +647200 +662200 +662600 +666200 +667016 +670000 +671279 +673600 +674000 +679700 +680000 +682897 +686220 +686730 +690100 +697000 +699030 +700000 +705000 +710000 +718440 +719000 +719570 +731452 +739000 +739640 +747200 +749600 +750000 +755470 +756719 +766900 +777520 +800000 +803800 +810000 +825212 +833300 +838700 +867200 +873134 +879350 +885550 +903000 +910000 +938079 +952000 +956700 +959670 +964000 +993300 +994000 +996900 +1000000 +1010000 +1022000 +1030000 +1050860 +1055492 +1060890 +1097432 +1099300 +1100000 +1102700 +1117000 +1141236 +1170000 +1185900 +1200000 +1207000 +1226450 +1242000 +1253840 +1300000 +1300780 +1332804 +1343800 +1360000 +1400000 +1435225 +1446280 +1459558 +1462100 +1465890 +1530000 +1556805 +1574400 +1580000 +1605000 +1613800 +1651931 +1668600 +1700000 +1800000 +1816096 +1920362 +1950000 +2000000 +2040100 +2050000 +2190000 +2308000 +2370210 +2400000 +2418000 +2491276 +2500000 +2521000 +2553560 +2610000 +2640969 +2700000 +2893870 +2962000 +3000000 +3053100 +3090000 +3129707 +3179920 +3321107 +3336106 +3406660 +3739270 +3817000 +3835000 +3900000 +4000000 +4032220 +4340000 +4530000 +4554000 +4714500 +5240500 +5338800 +5585700 +5600000 +5655380 +5799000 +6000000 +6249000 +6400000 +6598100 +6700000 +6940000 +7000000 +7400000 +7520000 +7890000 +8000000 +9000000 +9089710 +10000000 +11000000 +17400000 +20200000 +44353022 +75889000 diff --git a/third_party/chinese_text_normalization/thrax/src/number_data/random-tst.txt b/third_party/chinese_text_normalization/thrax/src/number_data/random-tst.txt new file mode 100644 index 000000000..efce19a97 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/number_data/random-tst.txt @@ -0,0 +1,1000 @@ +209 +220 +250 +254 +263 +266 +276 +303 +310 +317 +322 +364 +386 +405 +414 +424 +429 +489 +505 +520 +523 +525 +554 +624 +627 +640 +665 +680 +704 +715 +723 +741 +742 +775 +776 +845 +847 +851 +868 +898 +921 +927 +972 +973 +984 +986 +994 +1038 +1055 +1077 +1079 +1083 +1090 +1123 +1137 +1161 +1184 +1186 +1235 +1257 +1258 +1285 +1302 +1307 +1311 +1358 +1369 +1372 +1383 +1391 +1418 +1441 +1442 +1447 +1476 +1478 +1509 +1535 +1548 +1550 +1571 +1581 +1593 +1615 +1623 +1639 +1660 +1686 +1688 +1717 +1735 +1782 +1813 +1815 +1824 +1831 +1875 +1881 +1924 +1931 +1949 +1951 +1966 +1970 +1984 +1990 +1992 +2012 +2013 +2024 +2040 +2058 +2062 +2064 +2067 +2075 +2116 +2130 +2135 +2171 +2197 +2200 +2215 +2220 +2226 +2246 +2259 +2277 +2294 +2303 +2318 +2342 +2347 +2349 +2355 +2364 +2413 +2419 +2420 +2433 +2441 +2445 +2451 +2468 +2488 +2498 +2499 +2500 +2502 +2514 +2523 +2524 +2557 +2568 +2598 +2609 +2612 +2629 +2685 +2697 +2718 +2724 +2734 +2739 +2760 +2763 +2779 +2796 +2797 +2809 +2818 +2828 +2839 +2842 +2850 +2857 +2864 +2916 +2923 +2984 +2987 +2991 +2994 +3021 +3025 +3026 +3054 +3070 +3080 +3086 +3098 +3114 +3121 +3130 +3136 +3137 +3157 +3175 +3182 +3200 +3233 +3245 +3250 +3270 +3298 +3303 +3330 +3341 +3347 +3368 +3392 +3394 +3398 +3400 +3427 +3435 +3441 +3449 +3474 +3477 +3497 +3501 +3525 +3526 +3551 +3570 +3576 +3597 +3612 +3630 +3636 +3639 +3649 +3651 +3675 +3692 +3719 +3742 +3773 +3785 +3790 +3850 +3870 +3873 +3875 +3885 +3910 +3926 +3927 +3928 +3941 +3943 +3945 +3950 +3961 +3971 +3990 +3992 +3996 +4010 +4013 +4018 +4024 +4032 +4047 +4065 +4069 +4079 +4089 +4097 +4114 +4125 +4127 +4148 +4155 +4173 +4180 +4206 +4249 +4256 +4284 +4298 +4303 +4305 +4345 +4354 +4409 +4417 +4433 +4437 +4470 +4474 +4486 +4494 +4527 +4538 +4544 +4572 +4629 +4630 +4634 +4647 +4652 +4654 +4658 +4680 +4699 +4747 +4748 +4773 +4791 +4852 +4863 +4884 +4907 +4927 +4943 +4953 +5027 +5032 +5037 +5080 +5095 +5108 +5134 +5163 +5168 +5186 +5210 +5236 +5237 +5265 +5273 +5283 +5330 +5351 +5362 +5396 +5438 +5446 +5465 +5495 +5511 +5526 +5534 +5556 +5567 +5611 +5639 +5642 +5725 +5738 +5751 +5774 +5777 +5786 +5813 +5837 +5864 +5879 +5885 +5889 +5898 +5921 +5924 +5946 +5955 +5959 +5968 +5976 +5981 +6021 +6047 +6049 +6080 +6158 +6162 +6170 +6176 +6206 +6214 +6220 +6243 +6253 +6261 +6284 +6307 +6322 +6330 +6338 +6367 +6413 +6430 +6434 +6437 +6470 +6492 +6499 +6504 +6512 +6660 +6670 +6680 +6699 +6710 +6737 +6741 +6751 +6776 +6779 +6802 +6819 +6890 +6892 +6969 +6970 +7040 +7045 +7052 +7063 +7065 +7088 +7128 +7129 +7133 +7155 +7164 +7166 +7181 +7210 +7219 +7234 +7236 +7256 +7266 +7270 +7303 +7364 +7370 +7378 +7499 +7593 +7629 +7633 +7640 +7675 +7709 +7753 +7791 +7792 +7812 +7838 +7860 +7890 +7972 +8014 +8025 +8096 +8106 +8123 +8154 +8159 +8200 +8228 +8343 +8381 +8429 +8490 +8515 +8526 +8560 +8568 +8579 +8658 +8668 +8672 +8688 +8710 +8731 +8739 +8752 +8771 +8790 +8833 +8900 +8917 +8929 +9002 +9035 +9043 +9067 +9078 +9122 +9138 +9144 +9183 +9199 +9211 +9235 +9240 +9257 +9330 +9385 +9390 +9450 +9512 +9523 +9530 +9535 +9564 +9596 +9601 +9602 +9603 +9626 +9655 +9691 +9695 +9772 +9780 +9808 +9849 +9881 +9911 +9923 +9946 +9970 +9986 +10009 +10019 +10168 +10178 +10180 +10190 +10290 +10348 +10470 +10520 +10525 +10535 +10545 +10627 +10675 +10715 +10757 +10772 +10786 +10896 +10940 +10970 +11000 +11101 +11120 +11132 +11192 +11201 +11209 +11265 +11337 +11392 +11549 +11557 +11567 +11736 +11767 +11807 +11814 +11866 +11881 +11913 +12073 +12098 +12111 +12137 +12291 +12370 +12376 +12397 +12435 +12439 +12443 +12511 +12520 +12567 +12575 +12615 +12700 +12710 +12726 +12729 +12814 +12822 +12883 +12890 +12910 +12915 +12980 +13069 +13075 +13127 +13193 +13209 +13386 +13390 +13393 +13511 +13586 +13607 +13625 +13630 +13647 +13656 +13763 +13810 +13910 +13979 +13991 +14073 +14096 +14111 +14170 +14210 +14259 +14306 +14350 +14351 +14360 +14479 +14587 +14613 +14736 +14745 +14797 +14810 +14822 +14824 +14830 +15020 +15068 +15118 +15197 +15230 +15270 +15310 +15404 +15510 +15603 +15680 +15700 +15721 +15820 +15928 +15990 +16012 +16018 +16030 +16073 +16123 +16243 +16275 +16501 +16690 +16710 +16765 +16870 +16958 +17014 +17030 +17138 +17190 +17272 +17409 +17424 +17430 +17477 +17678 +17684 +17687 +17820 +17840 +17898 +18097 +18219 +18284 +18349 +18525 +18634 +18680 +19042 +19070 +19084 +19120 +19151 +19250 +19389 +19679 +19932 +20080 +20100 +20133 +20321 +20440 +20801 +20819 +20969 +21190 +21300 +21340 +21350 +21360 +21490 +21531 +21640 +21728 +21796 +21831 +21860 +22040 +22208 +22282 +22410 +22566 +22850 +23060 +23196 +23380 +24190 +24350 +24360 +24380 +24475 +24480 +24491 +24521 +24644 +24695 +24747 +24760 +24945 +25000 +25510 +25754 +25870 +26200 +26300 +26410 +26447 +26472 +26510 +27000 +27017 +27400 +27430 +27531 +27600 +27740 +27870 +28200 +28544 +28570 +28618 +28629 +28716 +28753 +28850 +29027 +29040 +29045 +29129 +29190 +29404 +29600 +29970 +30030 +30050 +30190 +30375 +30500 +30700 +30778 +30790 +30838 +31310 +31379 +31480 +31547 +31698 +31986 +32600 +32991 +33417 +33603 +34751 +34900 +34980 +35059 +35101 +35190 +35496 +35500 +35707 +35761 +36320 +36496 +36893 +37200 +37520 +37780 +38370 +38500 +38600 +39200 +39575 +39580 +40324 +40560 +41222 +41300 +41485 +41973 +43110 +43229 +44097 +44550 +44666 +45078 +45085 +45090 +45600 +46170 +46772 +47060 +48280 +48500 +48518 +49400 +49430 +50100 +50167 +50359 +50800 +51386 +51390 +51531 +51800 +52092 +52100 +52590 +52663 +52670 +52738 +52990 +53025 +53450 +53600 +53620 +54070 +54505 +56160 +56165 +57100 +57730 +58825 +58900 +60151 +60500 +61306 +61710 +62250 +62270 +62400 +63310 +63960 +64235 +64760 +65200 +65654 +66240 +66400 +66600 +68670 +68920 +71000 +71400 +72630 +72700 +72860 +73700 +75841 +76108 +77122 +79220 +79400 +79670 +81110 +83574 +84100 +84500 +86090 +87078 +87300 +87860 +88340 +88880 +89154 +89950 +92600 +96220 +96870 +97503 +99600 +101000 +104000 +105100 +105570 +106900 +108290 +108400 +110840 +110975 +113773 +115000 +116500 +119200 +124720 +127000 +127780 +128200 +128966 +138900 +140900 +141000 +141228 +144000 +145000 +145061 +147245 +147562 +148450 +152218 +154990 +158775 +159940 +161000 +161300 +163500 +165500 +170559 +176000 +178000 +184000 +188800 +196100 +204400 +204880 +210900 +216616 +220930 +238000 +239740 +257226 +265000 +271590 +273200 +285810 +309620 +315612 +320959 +321500 +341400 +348697 +350260 +359030 +360000 +360600 +376500 +378265 +383070 +394740 +410000 +446000 +471750 +497384 +510600 +560000 +590000 +608400 +696900 +704000 +1448374 +2256800 +3275000 +3980000 +4500000 +5066940 +5166299 +7113500 +9842447 +13020696 +70477170 diff --git a/third_party/chinese_text_normalization/thrax/src/ru/README.md b/third_party/chinese_text_normalization/thrax/src/ru/README.md new file mode 100644 index 000000000..c02d2935d --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/ru/README.md @@ -0,0 +1,6 @@ +# Russian covering grammar definitions + +This directory defines a Russian text normalization covering grammar. The +primary entry-point is the FST `VERBALIZER`, defined in +`verbalizer/verbalizer.grm` and compiled in the FST archive +`verbalizer/verbalizer.far`. diff --git a/third_party/chinese_text_normalization/thrax/src/ru/classifier/cyrillic.grm b/third_party/chinese_text_normalization/thrax/src/ru/classifier/cyrillic.grm new file mode 100644 index 000000000..0672e45a1 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/ru/classifier/cyrillic.grm @@ -0,0 +1,58 @@ +# Copyright 2017 Google Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +export kRussianLowerAlpha = Optimize[ + "а" | "б" | "в" | "г" | "д" | "е" | "ё" | "ж" | "з" | "и" | "й" | + "к" | "л" | "м" | "н" | "о" | "п" | "р" | "с" | "т" | "у" | "ф" | + "х" | "ц" | "ч" | "ш" | "щ" | "ъ" | "ы" | "ь" | "э" | "ю" | "я" ]; + +export kRussianUpperAlpha = Optimize[ + "А" | "Б" | "В" | "Г" | "Д" | "Е" | "Ё" | "Ж" | "З" | "И" | "Й" | + "К" | "Л" | "М" | "Н" | "О" | "П" | "Р" | "С" | "Т" | "У" | "Ф" | + "Х" | "Ц" | "Ч" | "Ш" | "Щ" | "Ъ" | "Ы" | "Ь" | "Э" | "Ю" | "Я" ]; + +export kRussianLowerAlphaStressed = Optimize[ + "а́" | "е́" | "ё́" | "и́" | "о́" | "у́" | "ы́" | "э́" | "ю́" | "я́" ]; + +export kRussianUpperAlphaStressed = Optimize[ + "А́" | "Е́" | "Ё́" | "И́" | "О́" | "У́" | "Ы́" | "Э́" | "Ю́" | "Я́" ]; + +export kRussianRewriteStress = Optimize[ + ("А́" : "А'") | ("Е́" : "Е'") | ("Ё́" : "Ё'") | ("И́" : "И'") | + ("О́" : "О'") | ("У́" : "У'") | ("Ы́" : "Ы'") | ("Э́" : "Э'") | + ("Ю́" : "Ю'") | ("Я́" : "Я'") | + ("а́" : "а'") | ("е́" : "е'") | ("ё́" : "ё'") | ("и́" : "и'") | + ("о́" : "о'") | ("у́" : "у'") | ("ы́" : "ы'") | ("э́" : "э'") | + ("ю́" : "ю'") | ("я́" : "я'") +]; + +export kRussianRemoveStress = Optimize[ + ("А́" : "А") | ("Е́" : "Е") | ("Ё́" : "Ё") | ("И́" : "И") | ("О́" : "О") | + ("У́" : "У") | ("Ы́" : "Ы") | ("Э́" : "Э") | ("Ю́" : "Ю") | ("Я́" : "Я") | + ("а́" : "а") | ("е́" : "е") | ("ё́" : "ё") | ("и́" : "и") | ("о́" : "о") | + ("у́" : "у") | ("ы́" : "ы") | ("э́" : "э") | ("ю́" : "ю") | ("я́" : "я") +]; + +# Pre-reform characters, just in case. +export kRussianPreReform = Optimize[ + "ѣ" | "Ѣ" # http://en.wikipedia.org/wiki/Yat +]; + +export kCyrillicAlphaStressed = Optimize[ + kRussianLowerAlphaStressed | kRussianUpperAlphaStressed +]; + +export kCyrillicAlpha = Optimize[ + kRussianLowerAlpha | kRussianUpperAlpha | kRussianPreReform +]; diff --git a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/cardinals-lex.grm b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/cardinals-lex.grm new file mode 100644 index 000000000..c07a7ae1c --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/cardinals-lex.grm @@ -0,0 +1,338 @@ +# Copyright 2017 Google Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# AUTOMATICALLY GENERATED: DO NOT EDIT. +import 'util/byte.grm' as b; + +# Utilities for insertion and deletion. + +func I[expr] { + return "" : expr; +} + +func D[expr] { + return expr : ""; +} + +# Powers of base 10. +export POWERS = + "[E15]" + | "[E14]" + | "[E13]" + | "[E12]" + | "[E11]" + | "[E10]" + | "[E9]" + | "[E8]" + | "[E7]" + | "[E6]" + | "[E5]" + | "[E4]" + | "[E3]" + | "[E2]" + | "[E1]" +; + +export SIGMA = b.kBytes | POWERS; + +export SIGMA_STAR = SIGMA*; + +export SIGMA_PLUS = SIGMA+; + +################################################################################ +# BEGIN LANGUAGE SPECIFIC DATA +revaluations = + ("[E4]" : "[E1]") + | ("[E5]" : "[E2]") + | ("[E7]" : "[E1]") + | ("[E8]" : "[E2]") +; + +Ms = "[E3]" | "[E6]" | "[E9]"; + + +func Zero[expr] { + return expr : (""); +} + +space = " "; + +lexset3 = Optimize[ + ("1[E1]+1" : "одиннадцати") + | ("1[E1]+1" : "одиннадцать") + | ("1[E1]+1" : "одиннадцатью") + | ("1[E1]+2" : "двенадцати") + | ("1[E1]+2" : "двенадцать") + | ("1[E1]+2" : "двенадцатью") + | ("1[E1]+3" : "тринадцати") + | ("1[E1]+3" : "тринадцать") + | ("1[E1]+3" : "тринадцатью") + | ("1[E1]+4" : "четырнадцати") + | ("1[E1]+4" : "четырнадцать") + | ("1[E1]+4" : "четырнадцатью") + | ("1[E1]+5" : "пятнадцати") + | ("1[E1]+5" : "пятнадцать") + | ("1[E1]+5" : "пятнадцатью") + | ("1[E1]+6" : "шестнадцати") + | ("1[E1]+6" : "шестнадцать") + | ("1[E1]+6" : "шестнадцатью") + | ("1[E1]+7" : "семнадцати") + | ("1[E1]+7" : "семнадцать") + | ("1[E1]+7" : "семнадцатью") + | ("1[E1]+8" : "восемнадцати") + | ("1[E1]+8" : "восемнадцать") + | ("1[E1]+8" : "восемнадцатью") + | ("1[E1]+9" : "девятнадцати") + | ("1[E1]+9" : "девятнадцать") + | ("1[E1]+9" : "девятнадцатью")] +; + +lex3 = CDRewrite[lexset3 I[space], "", "", SIGMA_STAR]; + +lexset2 = Optimize[ + ("1[E1]" : "десяти") + | ("1[E1]" : "десять") + | ("1[E1]" : "десятью") + | ("1[E2]" : "ста") + | ("1[E2]" : "сто") + | ("2[E1]" : "двадцати") + | ("2[E1]" : "двадцать") + | ("2[E1]" : "двадцатью") + | ("2[E2]" : "двести") + | ("2[E2]" : "двумстам") + | ("2[E2]" : "двумястами") + | ("2[E2]" : "двухсот") + | ("2[E2]" : "двухстах") + | ("3[E1]" : "тридцати") + | ("3[E1]" : "тридцать") + | ("3[E1]" : "тридцатью") + | ("3[E2]" : "тремстам") + | ("3[E2]" : "тремястами") + | ("3[E2]" : "трехсот") + | ("3[E2]" : "трехстах") + | ("3[E2]" : "триста") + | ("4[E1]" : "сорок") + | ("4[E1]" : "сорока") + | ("4[E2]" : "четыремстам") + | ("4[E2]" : "четыреста") + | ("4[E2]" : "четырехсот") + | ("4[E2]" : "четырехстах") + | ("4[E2]" : "четырьмястами") + | ("5[E1]" : "пятидесяти") + | ("5[E1]" : "пятьдесят") + | ("5[E1]" : "пятьюдесятью") + | ("5[E2]" : "пятисот") + | ("5[E2]" : "пятистам") + | ("5[E2]" : "пятистах") + | ("5[E2]" : "пятьсот") + | ("5[E2]" : "пятьюстами") + | ("6[E1]" : "шестидесяти") + | ("6[E1]" : "шестьдесят") + | ("6[E1]" : "шестьюдесятью") + | ("6[E2]" : "шестисот") + | ("6[E2]" : "шестистам") + | ("6[E2]" : "шестистах") + | ("6[E2]" : "шестьсот") + | ("6[E2]" : "шестьюстами") + | ("7[E1]" : "семидесяти") + | ("7[E1]" : "семьдесят") + | ("7[E1]" : "семьюдесятью") + | ("7[E2]" : "семисот") + | ("7[E2]" : "семистам") + | ("7[E2]" : "семистах") + | ("7[E2]" : "семьсот") + | ("7[E2]" : "семьюстами") + | ("8[E1]" : "восемьдесят") + | ("8[E1]" : "восьмидесяти") + | ("8[E1]" : "восьмьюдесятью") + | ("8[E2]" : "восемьсот") + | ("8[E2]" : "восемьюстами") + | ("8[E2]" : "восьмисот") + | ("8[E2]" : "восьмистам") + | ("8[E2]" : "восьмистах") + | ("8[E2]" : "восьмьюстами") + | ("9[E1]" : "девяноста") + | ("9[E1]" : "девяносто") + | ("9[E2]" : "девятисот") + | ("9[E2]" : "девятистам") + | ("9[E2]" : "девятистах") + | ("9[E2]" : "девятьсот") + | ("9[E2]" : "девятьюстами")] +; + +lex2 = CDRewrite[lexset2 I[space], "", "", SIGMA_STAR]; + +lexset1 = Optimize[ + ("+" : "") + | ("1" : "один") + | ("1" : "одна") + | ("1" : "одни") + | ("1" : "одним") + | ("1" : "одними") + | ("1" : "одних") + | ("1" : "одно") + | ("1" : "одного") + | ("1" : "одной") + | ("1" : "одном") + | ("1" : "одному") + | ("1" : "одною") + | ("1" : "одну") + | ("2" : "два") + | ("2" : "две") + | ("2" : "двум") + | ("2" : "двумя") + | ("2" : "двух") + | ("3" : "трем") + | ("3" : "тремя") + | ("3" : "трех") + | ("3" : "три") + | ("4" : "четыре") + | ("4" : "четырем") + | ("4" : "четырех") + | ("4" : "четырьмя") + | ("5" : "пяти") + | ("5" : "пять") + | ("5" : "пятью") + | ("6" : "шести") + | ("6" : "шесть") + | ("6" : "шестью") + | ("7" : "семи") + | ("7" : "семь") + | ("7" : "семью") + | ("8" : "восемь") + | ("8" : "восьми") + | ("8" : "восьмью") + | ("9" : "девяти") + | ("9" : "девять") + | ("9" : "девятью") + | ("[E3]" : "тысяч") + | ("[E3]" : "тысяча") + | ("[E3]" : "тысячам") + | ("[E3]" : "тысячами") + | ("[E3]" : "тысячах") + | ("[E3]" : "тысяче") + | ("[E3]" : "тысячей") + | ("[E3]" : "тысячи") + | ("[E3]" : "тысячу") + | ("[E3]" : "тысячью") + | ("[E6]" : "миллион") + | ("[E6]" : "миллиона") + | ("[E6]" : "миллионам") + | ("[E6]" : "миллионами") + | ("[E6]" : "миллионах") + | ("[E6]" : "миллионе") + | ("[E6]" : "миллионов") + | ("[E6]" : "миллионом") + | ("[E6]" : "миллиону") + | ("[E6]" : "миллионы") + | ("[E9]" : "миллиард") + | ("[E9]" : "миллиарда") + | ("[E9]" : "миллиардам") + | ("[E9]" : "миллиардами") + | ("[E9]" : "миллиардах") + | ("[E9]" : "миллиарде") + | ("[E9]" : "миллиардов") + | ("[E9]" : "миллиардом") + | ("[E9]" : "миллиарду") + | ("[E9]" : "миллиарды") + | ("|0|" : "ноле") + | ("|0|" : "нолем") + | ("|0|" : "ноль") + | ("|0|" : "нолю") + | ("|0|" : "ноля") + | ("|0|" : "нуле") + | ("|0|" : "нулем") + | ("|0|" : "нуль") + | ("|0|" : "нулю") + | ("|0|" : "нуля")] +; + +lex1 = CDRewrite[lexset1 I[space], "", "", SIGMA_STAR]; + +export LEX = Optimize[lex3 @ lex2 @ lex1]; + +export INDEPENDENT_EXPONENTS = "[E3]" | "[E6]" | "[E9]"; + +# END LANGUAGE SPECIFIC DATA +################################################################################ +# Inserts a marker after the Ms. +export INSERT_BOUNDARY = CDRewrite["" : "%", Ms, "", SIGMA_STAR]; + +# Deletes all powers and "+". +export DELETE_POWERS = CDRewrite[D[POWERS | "+"], "", "", SIGMA_STAR]; + +# Deletes trailing zeros at the beginning of a number, so that "0003" does not +# get treated as an ordinary number. +export DELETE_INITIAL_ZEROS = + CDRewrite[("0" POWERS "+") : "", "[BOS]", "", SIGMA_STAR] +; + +NonMs = Optimize[POWERS - Ms]; + +# Deletes (usually) zeros before a non-M. E.g., +0[E1] should be deleted. +export DELETE_INTERMEDIATE_ZEROS1 = + CDRewrite[Zero["+0" NonMs], "", "", SIGMA_STAR] +; + +# Deletes (usually) zeros before an M, if there is no non-zero element between +# that and the previous boundary. Thus, if after the result of the rule above we +# end up with "%+0[E3]", then that gets deleted. Also (really) deletes a final +# zero. +export DELETE_INTERMEDIATE_ZEROS2 = Optimize[ + CDRewrite[Zero["%+0" Ms], "", "", SIGMA_STAR] + @ CDRewrite[D["+0"], "", "[EOS]", SIGMA_STAR]] +; + +# Final clean up of stray zeros. +export DELETE_REMAINING_ZEROS = Optimize[ + CDRewrite[Zero["+0"], "", "", SIGMA_STAR] + @ CDRewrite[Zero["0"], "", "", SIGMA_STAR]] +; + +# Applies the revaluation map. For example in English, changes [E4] to [E1] as a +# modifier of [E3]. +export REVALUE = CDRewrite[revaluations, "", "", SIGMA_STAR]; + +# Deletes the various marks and powers in the input and output. +export DELETE_MARKS = CDRewrite[D["%" | "+" | POWERS], "", "", SIGMA_STAR]; + +export CLEAN_SPACES = Optimize[ + CDRewrite[" "+ : " ", b.kNotSpace, b.kNotSpace, SIGMA_STAR] + @ CDRewrite[" "* : "", "[BOS]", "", SIGMA_STAR] + @ CDRewrite[" "* : "", "", "[EOS]", SIGMA_STAR]] +; + +d = b.kDigit; + +# Germanic inversion rule. +germanic = + (I["1+"] d "[E1]" D["+1"]) + | (I["2+"] d "[E1]" D["+2"]) + | (I["3+"] d "[E1]" D["+3"]) + | (I["4+"] d "[E1]" D["+4"]) + | (I["5+"] d "[E1]" D["+5"]) + | (I["6+"] d "[E1]" D["+6"]) + | (I["7+"] d "[E1]" D["+7"]) + | (I["8+"] d "[E1]" D["+8"]) + | (I["9+"] d "[E1]" D["+9"]) +; + +germanic_inversion = + CDRewrite[germanic, "", "", SIGMA_STAR, 'ltr', 'opt'] +; + +export GERMANIC_INVERSION = SIGMA_STAR; +export ORDINAL_RESTRICTION = SIGMA_STAR; +nondigits = b.kBytes - b.kDigit; +export ORDINAL_SUFFIX = D[nondigits*]; diff --git a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/cardinals.tsv b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/cardinals.tsv new file mode 100644 index 000000000..484a5c8a7 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/cardinals.tsv @@ -0,0 +1,177 @@ +0 ноле +0 ноль +0 нолю +0 ноля +0 нолём +0 нуле +0 нуль +0 нулю +0 нуля +0 нулём +1 один +1 одна +1 одни +1 одним +1 одними +1 одних +1 одно +1 одного +1 одной +1 одном +1 одному +1 одною +1 раз +1 одну +2 два +2 две +2 двум +2 двумя +2 двух +3 тремя +3 три +3 трём +3 трёх +4 четыре +4 четырьмя +4 четырём +4 четырёх +5 пяти +5 пять +5 пятью +6 шести +6 шесть +6 шестью +7 семи +7 семь +7 семью +8 восемь +8 восьми +8 восьмью +9 девяти +9 девять +9 девятью +10 десяти +10 десять +10 десятью +11 одиннадцати +11 одиннадцать +11 одиннадцатью +12 двенадцати +12 двенадцать +12 двенадцатью +13 тринадцати +13 тринадцать +13 тринадцатью +14 четырнадцати +14 четырнадцать +14 четырнадцатью +15 пятнадцати +15 пятнадцать +15 пятнадцатью +16 шестнадцати +16 шестнадцать +16 шестнадцатью +17 семнадцати +17 семнадцать +17 семнадцатью +18 восемнадцати +18 восемнадцать +18 восемнадцатью +19 девятнадцати +19 девятнадцать +19 девятнадцатью +20 двадцати +20 двадцать +20 двадцатью +30 тридцати +30 тридцать +30 тридцатью +40 сорок +40 сорока +50 пятидесяти +50 пятьдесят +50 пятьюдесятью +60 шестидесяти +60 шестьдесят +60 шестьюдесятью +70 семидесяти +70 семьдесят +70 семьюдесятью +80 восемьдесят +80 восьмидесяти +80 восьмьюдесятью +90 девяноста +90 девяносто +100 ста +100 сто +200 двести +200 двумстам +200 двумястами +200 двухсот +200 двухстах +300 тремястами +300 трехсот +300 триста +300 трёмстам +300 трёхстах +400 четыреста +400 четырьмястами +400 четырёмстам +400 четырёхсот +400 четырёхстах +500 пятисот +500 пятистам +500 пятистах +500 пятьсот +500 пятьюстами +600 шестисот +600 шестистам +600 шестистах +600 шестьсот +600 шестьюстами +700 семисот +700 семистам +700 семистах +700 семьсот +700 семьюстами +800 восемьсот +800 восемьюстами +800 восьмисот +800 восьмистам +800 восьмистах +800 восьмьюстами +900 девятисот +900 девятистам +900 девятистах +900 девятьсот +900 девятьюстами +1000 тысяч +1000 тысяча +1000 тысячам +1000 тысячами +1000 тысячах +1000 тысяче +1000 тысячей +1000 тысячи +1000 тысячу +1000 тысячью +1000000 миллион +1000000 миллиона +1000000 миллионам +1000000 миллионами +1000000 миллионах +1000000 миллионе +1000000 миллионов +1000000 миллионом +1000000 миллиону +1000000 миллионы +1000000000 миллиард +1000000000 миллиарда +1000000000 миллиардам +1000000000 миллиардами +1000000000 миллиардах +1000000000 миллиарде +1000000000 миллиардов +1000000000 миллиардом +1000000000 миллиарду +1000000000 миллиарды diff --git a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/extra_numbers.grm b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/extra_numbers.grm new file mode 100644 index 000000000..644f30dff --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/extra_numbers.grm @@ -0,0 +1,35 @@ +# Copyright 2017 Google Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import 'util/byte.grm' as b; +import 'ru/verbalizer/numbers.grm' as n; + +digit = b.kDigit @ n.CARDINAL_NUMBERS | ("0" : "@@OTHER_ZERO_VERBALIZATIONS@@"); + +export DIGITS = digit (n.I[" "] digit)*; + +# Various common factorizations + +two_digits = b.kDigit{2} @ n.CARDINAL_NUMBERS; + +three_digits = b.kDigit{3} @ n.CARDINAL_NUMBERS; + +mixed = + (digit n.I[" "] two_digits) + | (two_digits n.I[" "] two_digits) + | (two_digits n.I[" "] three_digits) + | (two_digits n.I[" "] two_digits n.I[" "] two_digits) +; + +export MIXED_NUMBERS = Optimize[mixed]; diff --git a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/factorization.grm b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/factorization.grm new file mode 100644 index 000000000..860161463 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/factorization.grm @@ -0,0 +1,40 @@ +# Copyright 2017 Google Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import 'util/byte.grm' as b; +import 'util/util.grm' as u; +import 'ru/verbalizer/numbers.grm' as n; + +func ToNumberName[expr] { + number_name_seq = n.CARDINAL_NUMBERS (" " n.CARDINAL_NUMBERS)*; + return Optimize[expr @ number_name_seq]; +} + +d = b.kDigit; + +leading_zero = CDRewrite[n.I[" "], ("[BOS]" | " ") "0", "", b.kBytes*]; + +by_ones = d n.I[" "]; +by_twos = (d{2} @ leading_zero) n.I[" "]; +by_threes = (d{3} @ leading_zero) n.I[" "]; + +groupings = by_twos* (by_threes | by_twos | by_ones); + +export FRACTIONAL_PART_UNGROUPED = + Optimize[ToNumberName[by_ones+ @ u.CLEAN_SPACES]] +; +export FRACTIONAL_PART_GROUPED = + Optimize[ToNumberName[groupings @ u.CLEAN_SPACES]] +; +export FRACTIONAL_PART_UNPARSED = Optimize[ToNumberName[d*]]; diff --git a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/float.grm b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/float.grm new file mode 100644 index 000000000..c608507a7 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/float.grm @@ -0,0 +1,30 @@ +# Copyright 2017 Google Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import 'ru/verbalizer/factorization.grm' as f; +import 'ru/verbalizer/lexical_map.grm' as l; +import 'ru/verbalizer/numbers.grm' as n; + +fractional_part_ungrouped = f.FRACTIONAL_PART_UNGROUPED; +fractional_part_grouped = f.FRACTIONAL_PART_GROUPED; +fractional_part_unparsed = f.FRACTIONAL_PART_UNPARSED; + +__fractional_part__ = fractional_part_unparsed; +__decimal_marker__ = ","; + +export FLOAT = Optimize[ + (n.CARDINAL_NUMBERS + (__decimal_marker__ : " @@DECIMAL_DOT_EXPRESSION@@ ") + __fractional_part__) @ l.LEXICAL_MAP] +; diff --git a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/g.fst b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/g.fst new file mode 100644 index 000000000..66665f390 Binary files /dev/null and b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/g.fst differ diff --git a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/lexical_map.grm b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/lexical_map.grm new file mode 100644 index 000000000..e7bb32b0b --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/lexical_map.grm @@ -0,0 +1,25 @@ +# Copyright 2017 Google Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import 'util/byte.grm' as b; + +lexical_map = StringFile['ru/verbalizer/lexical_map.tsv']; + +sigma_star = b.kBytes*; + +del_null = CDRewrite["__NULL__" : "", "", "", sigma_star]; + +export LEXICAL_MAP = Optimize[ + CDRewrite[lexical_map, "", "", sigma_star] @ del_null] +; diff --git a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/lexical_map.tsv b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/lexical_map.tsv new file mode 100644 index 000000000..b78cf73df --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/lexical_map.tsv @@ -0,0 +1,221 @@ +@@CONNECTOR_RANGE@@ до +@@CONNECTOR_RATIO@@ к +@@CONNECTOR_BY@@ на +@@CONNECTOR_CONSECUTIVE_YEAR@@ до +@@JANUARY@@ январь +@@JANUARY@@ январи +@@JANUARY@@ января +@@JANUARY@@ январей +@@JANUARY@@ январю +@@JANUARY@@ январям +@@JANUARY@@ январь +@@JANUARY@@ январи +@@JANUARY@@ январём +@@JANUARY@@ январями +@@JANUARY@@ январе +@@JANUARY@@ январях +@@FEBRUARY@@ февраль +@@FEBRUARY@@ феврали +@@FEBRUARY@@ февраля +@@FEBRUARY@@ февралей +@@FEBRUARY@@ февралю +@@FEBRUARY@@ февралям +@@FEBRUARY@@ февраль +@@FEBRUARY@@ феврали +@@FEBRUARY@@ февралём +@@FEBRUARY@@ февралями +@@FEBRUARY@@ феврале +@@FEBRUARY@@ февралях +@@MARCH@@ март +@@MARCH@@ марты +@@MARCH@@ марта +@@MARCH@@ мартов +@@MARCH@@ марту +@@MARCH@@ мартам +@@MARCH@@ март +@@MARCH@@ марты +@@MARCH@@ мартом +@@MARCH@@ мартами +@@MARCH@@ марте +@@MARCH@@ мартах +@@APRIL@@ апрель +@@APRIL@@ апрели +@@APRIL@@ апреля +@@APRIL@@ апрелей +@@APRIL@@ апрелю +@@APRIL@@ апрелям +@@APRIL@@ апрель +@@APRIL@@ апрели +@@APRIL@@ апрелем +@@APRIL@@ апрелями +@@APRIL@@ апреле +@@APRIL@@ апрелях +@@MAY@@ май +@@MAY@@ маи +@@MAY@@ мая +@@MAY@@ маев +@@MAY@@ маю +@@MAY@@ маям +@@MAY@@ май +@@MAY@@ маи +@@MAY@@ маем +@@MAY@@ маями +@@MAY@@ мае +@@MAY@@ маях +@@JUN@@ июнь +@@JUN@@ июни +@@JUN@@ июня +@@JUN@@ июней +@@JUN@@ июню +@@JUN@@ июням +@@JUN@@ июнь +@@JUN@@ июни +@@JUN@@ июнем +@@JUN@@ июнями +@@JUN@@ июне +@@JUN@@ июнях +@@JUL@@ июль +@@JUL@@ июли +@@JUL@@ июля +@@JUL@@ июлей +@@JUL@@ июлю +@@JUL@@ июлям +@@JUL@@ июль +@@JUL@@ июли +@@JUL@@ июлем +@@JUL@@ июлями +@@JUL@@ июле +@@JUL@@ июлях +@@AUGUST@@ август +@@AUGUST@@ августы +@@AUGUST@@ августа +@@AUGUST@@ августов +@@AUGUST@@ августу +@@AUGUST@@ августам +@@AUGUST@@ август +@@AUGUST@@ августы +@@AUGUST@@ августом +@@AUGUST@@ августами +@@AUGUST@@ августе +@@AUGUST@@ августах +@@SEPTEMBER@@ сентябрь +@@SEPTEMBER@@ сентябри +@@SEPTEMBER@@ сентября +@@SEPTEMBER@@ сентябрей +@@SEPTEMBER@@ сентябрю +@@SEPTEMBER@@ сентябрям +@@SEPTEMBER@@ сентябрь +@@SEPTEMBER@@ сентябри +@@SEPTEMBER@@ сентябрём +@@SEPTEMBER@@ сентябрями +@@SEPTEMBER@@ сентябре +@@SEPTEMBER@@ сентябрях +@@OCTOBER@@ октябрь +@@OCTOBER@@ октябри +@@OCTOBER@@ октября +@@OCTOBER@@ октябрей +@@OCTOBER@@ октябрю +@@OCTOBER@@ октябрям +@@OCTOBER@@ октябрь +@@OCTOBER@@ октябри +@@OCTOBER@@ октябрём +@@OCTOBER@@ октябрями +@@OCTOBER@@ октябре +@@OCTOBER@@ октябрях +@@NOVEMBER@@ ноябрь +@@NOVEMBER@@ ноябри +@@NOVEMBER@@ ноября +@@NOVEMBER@@ ноябрей +@@NOVEMBER@@ ноябрю +@@NOVEMBER@@ ноябрям +@@NOVEMBER@@ ноябрь +@@NOVEMBER@@ ноябри +@@NOVEMBER@@ ноябрём +@@NOVEMBER@@ ноябрями +@@NOVEMBER@@ ноябре +@@NOVEMBER@@ ноябрях +@@DECEMBER@@ декабрь +@@DECEMBER@@ декабри +@@DECEMBER@@ декабря +@@DECEMBER@@ декабрей +@@DECEMBER@@ декабрю +@@DECEMBER@@ декабрям +@@DECEMBER@@ декабрь +@@DECEMBER@@ декабри +@@DECEMBER@@ декабрём +@@DECEMBER@@ декабрями +@@DECEMBER@@ декабре +@@DECEMBER@@ декабрях +@@MINUS@@ минус +@@DECIMAL_DOT_EXPRESSION@@ целая +@@DECIMAL_DOT_EXPRESSION@@ целой +@@DECIMAL_DOT_EXPRESSION@@ целой +@@DECIMAL_DOT_EXPRESSION@@ целую +@@DECIMAL_DOT_EXPRESSION@@ целой +@@DECIMAL_DOT_EXPRESSION@@ целой +@@DECIMAL_DOT_EXPRESSION@@ целым +@@DECIMAL_DOT_EXPRESSION@@ целыми +@@DECIMAL_DOT_EXPRESSION@@ целых +@@DECIMAL_DOT_EXPRESSION@@ целых +@@URL_DOT_EXPRESSION@@ точка +@@PERIOD@@ точка +@@DECIMAL_EXPONENT@@ умножить на десять в степени +@@COLON@@ двоеточие +@@SLASH@@ косая черта +@@PASSWORD@@ пароль +@@AT@@ собака +@@PORT@@ порт +@@QUESTION_MARK@@ вопросительный знак +@@HASH@@ решётка +@@HASH@@ решетка +@@MONEY_AND@@ и +@@AND@@ и +@@PHONE_PLUS@@ плюс +@@ARITHMETIC_PLUS@@ плюс +@@PHONE_EXTENSION@@ добавочный номер +@@TIME_AM@@ утра +@@TIME_PM@@ вечера +@@HOUR@@ час +@@HOUR@@ часа +@@HOUR@@ часам +@@HOUR@@ часами +@@HOUR@@ часах +@@HOUR@@ часе +@@HOUR@@ часов +@@HOUR@@ часом +@@HOUR@@ часу +@@HOUR@@ часы +@@MINUTE@@ минут +@@MINUTE@@ минута +@@MINUTE@@ минутам +@@MINUTE@@ минутами +@@MINUTE@@ минутах +@@MINUTE@@ минуте +@@MINUTE@@ минутой +@@MINUTE@@ минутою +@@MINUTE@@ минуту +@@MINUTE@@ минуты +@@TIME_AFTER@@ __NULL__ +@@TIME_BEFORE_PRE@@ без +@@TIME_QUARTER@@ четверть +@@TIME_QUARTER@@ четверти +@@TIME_HALF@@ половина +@@TIME_HALF@@ половины +@@TIME_HALF@@ половину +@@TIME_HALF@@ половин +@@TIME_HALF@@ половине +@@TIME_HALF@@ половинам +@@TIME_HALF@@ половиной +@@TIME_HALF@@ половинами +@@TIME_HALF@@ половинах +@@PERCENT@@ процент +@@PERCENT@@ процента +@@PERCENT@@ процентам +@@PERCENT@@ процентами +@@PERCENT@@ процентах +@@PERCENT@@ проценте +@@PERCENT@@ процентов +@@PERCENT@@ процентом +@@PERCENT@@ проценту +@@PERCENT@@ проценты +@@PERCENT@@ проценты diff --git a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/math.grm b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/math.grm new file mode 100644 index 000000000..061de4a78 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/math.grm @@ -0,0 +1,34 @@ +# Copyright 2017 Google Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import 'ru/verbalizer/float.grm' as f; +import 'ru/verbalizer/lexical_map.grm' as l; +import 'ru/verbalizer/numbers.grm' as n; + +float = f.FLOAT; +card = n.CARDINAL_NUMBERS; +number = card | float; + +plus = "+" : " @@ARITHMETIC_PLUS@@ "; +times = "*" : " @@ARITHMETIC_TIMES@@ "; +minus = "-" : " @@ARITHMETIC_MINUS@@ "; +division = "/" : " @@ARITHMETIC_DIVISION@@ "; + +operator = plus | times | minus | division; + +percent = "%" : " @@PERCENT@@"; + +export ARITHMETIC = + Optimize[((number operator number) | (number percent)) @ l.LEXICAL_MAP] +; diff --git a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/miscellaneous.grm b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/miscellaneous.grm new file mode 100644 index 000000000..352363106 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/miscellaneous.grm @@ -0,0 +1,78 @@ +# Copyright 2017 Google Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import 'util/byte.grm' as b; +import 'ru/classifier/cyrillic.grm' as c; +import 'ru/verbalizer/extra_numbers.grm' as e; +import 'ru/verbalizer/lexical_map.grm' as l; +import 'ru/verbalizer/numbers.grm' as n; +import 'ru/verbalizer/spelled.grm' as s; + +letter = b.kAlpha | c.kCyrillicAlpha; +dash = "-"; +word = letter+; +possibly_split_word = word (((dash | ".") : " ") word)* n.D["."]?; + +post_word_symbol = + ("+" : ("@@ARITHMETIC_PLUS@@" | "@@POSITIVE@@")) | + ("-" : ("@@ARITHMETIC_MINUS@@" | "@@NEGATIVE@@")) | + ("*" : "@@STAR@@") +; + +pre_word_symbol = + ("@" : "@@AT@@") | + ("/" : "@@SLASH@@") | + ("#" : "@@HASH@@") +; + +post_word = possibly_split_word n.I[" "] post_word_symbol; + +pre_word = pre_word_symbol n.I[" "] possibly_split_word; + +## Number/digit sequence combos, maybe with a dash + +spelled_word = word @ s.SPELLED_NO_LETTER; + +word_number = + (word | spelled_word) + (n.I[" "] | (dash : " ")) + (e.DIGITS | n.CARDINAL_NUMBERS | e.MIXED_NUMBERS) +; + +number_word = + (e.DIGITS | n.CARDINAL_NUMBERS | e.MIXED_NUMBERS) + (n.I[" "] | (dash : " ")) + (word | spelled_word) +; + +## Two-digit year. + +# Note that in this case to be fair we really have to allow ordinals too since +# in some languages that's what you would have. + +two_digit_year = n.D["'"] (b.kDigit{2} @ (n.CARDINAL_NUMBERS | e.DIGITS)); + +dot_com = ("." : "@@URL_DOT_EXPRESSION@@") n.I[" "] "com"; + +miscellaneous = Optimize[ + possibly_split_word + | post_word + | pre_word + | word_number + | number_word + | two_digit_year + | dot_com +]; + +export MISCELLANEOUS = Optimize[miscellaneous @ l.LEXICAL_MAP]; diff --git a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/money.grm b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/money.grm new file mode 100644 index 000000000..ddea02431 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/money.grm @@ -0,0 +1,44 @@ +# Copyright 2017 Google Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import 'util/byte.grm' as b; +import 'ru/verbalizer/lexical_map.grm' as l; +import 'ru/verbalizer/numbers.grm' as n; + +card = n.CARDINAL_NUMBERS; + +__currency__ = StringFile['ru/verbalizer/money.tsv']; + +d = b.kDigit; +D = d - "0"; + +cents = ((n.D["0"] | D) d) @ card; + +# Only dollar for the verbalizer tests for English. Will need to add other +# currencies. +usd_maj = Project["usd_maj" @ __currency__, 'output']; +usd_min = Project["usd_min" @ __currency__, 'output']; +and = " @@MONEY_AND@@ " | " "; + +dollar1 = + n.D["$"] card n.I[" " usd_maj] n.I[and] n.D["."] cents n.I[" " usd_min] +; + +dollar2 = n.D["$"] card n.I[" " usd_maj] n.D["."] n.D["00"]; + +dollar3 = n.D["$"] card n.I[" " usd_maj]; + +dollar = Optimize[dollar1 | dollar2 | dollar3]; + +export MONEY = Optimize[dollar @ l.LEXICAL_MAP]; diff --git a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/money.tsv b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/money.tsv new file mode 100644 index 000000000..184ea8fe7 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/money.tsv @@ -0,0 +1,24 @@ +usd_maj доллара +usd_maj долларами +usd_maj долларам +usd_maj долларах +usd_maj долларе +usd_maj долларов +usd_maj долларом +usd_maj доллар +usd_maj доллар +usd_maj доллару +usd_maj доллары +usd_maj доллары +usd_min цент +usd_min цент +usd_min цента +usd_min центам +usd_min центами +usd_min центах +usd_min центе +usd_min центов +usd_min центом +usd_min центу +usd_min центы +usd_min центы diff --git a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/nominatives.tsv b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/nominatives.tsv new file mode 100644 index 000000000..fdfb61038 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/nominatives.tsv @@ -0,0 +1,166 @@ +нуль +ноль +один +два +две +три +четыре +пять +шесть +семь +восемь +девять +десять +одиннадцать +двенадцать +тринадцать +четырнадцать +пятнадцать +шестнадцать +семнадцать +восемнадцать +девятнадцать +двадцать +тридцать +сорок +пятьдесят +шестьдесят +семьдесят +восемьдесят +девяносто +сто +двести +триста +четыреста +пятьсот +шестьсот +семьсот +восемьсот +девятьсот +тысячи +тысяч +тысяча +миллионов +миллион +миллиона +миллиардов +миллиард +миллиарда +первая +первого +первое +первый +вторая +второе +второй +третий +третье +третья +четвертая +четвертое +четвертой +пятая +пятое +пятой +шестая +шестое +шестой +седьмая +седьмое +седьмой +восьмая +восьмое +восьмой +девятая +девятое +девятой +десятая +десятое +десятой +одиннадцатая +одиннадцатое +одиннадцатой +двенадцатая +двенадцатое +двенадцатой +тринадцатая +тринадцатое +тринадцатой +четырнадцатая +четырнадцатое +четырнадцатой +пятнадцатая +пятнадцатое +пятнадцатой +шестнадцатая +шестнадцатое +шестнадцатой +семнадцатая +семнадцатое +семнадцатой +восемнадцатая +восемнадцатое +восемнадцатой +девятнадцатая +девятнадцатое +девятнадцатой +двадцатая +двадцатое +двадцатой +тридцатая +тридцатое +тридцатой +сороковая +сороковое +сороковой +пятидесятая +пятидесятое +пятидесятой +шестидесятая +шестидесятое +шестидесятой +семидесятая +семидесятое +семидесятой +восьмидесятая +восьмидесятое +восьмидесятой +девяностая +девяностое +девяностой +сотая +сотое +сотой +двухсотая +двухсотое +двухсотой +трехсотая +трехсотое +трехсотой +четырехсотая +четырехсотое +четырехсотой +пятисотая +пятисотое +пятисотой +шестисотая +шестисотое +шестисотой +семисотая +семисотое +семисотой +восьмисотая +восьмисотое +восьмисотой +девятисотая +девятисотое +девятисотой +тысячная +тысячное +тысячной +миллионная +миллионное +миллионной +миллиардная +миллиардное +миллиардной diff --git a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/number_names.grm b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/number_names.grm new file mode 100644 index 000000000..84ac15a25 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/number_names.grm @@ -0,0 +1,48 @@ +# Copyright 2017 Google Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +# Russian minimally supervised number grammar. +# +# Supports cardinals and ordinals in all inflected forms. +# +# The language-specific acceptor G was compiled with digit, teen, decade, +# century, and big power-of-ten preterminals. The lexicon transducer is +# highly ambiguous, but no LM is used. + +import 'util/arithmetic.grm' as a; + +# Intersects the universal factorization transducer (F) with language-specific +# acceptor (G). + +d = a.DELTA_STAR; +f = a.IARITHMETIC_RESTRICTED; +g = LoadFst['ru/verbalizer/g.fst']; +fg = Optimize[d @ Optimize[f @ Optimize[f @ Optimize[f @ g]]]]; +test1 = AssertEqual["230" @ fg, "(+ 200 30 +)"]; + +# Compiles lexicon transducers (L). + +cardinal_name = StringFile['ru/verbalizer/cardinals.tsv']; +cardinal_l = Optimize[(cardinal_name " ")* cardinal_name]; + +ordinal_name = StringFile['ru/verbalizer/ordinals.tsv']; +ordinal_l = Optimize[(cardinal_name " ")* ordinal_name]; + +# Composes L with the leaf transducer (P), then composes that with FG. + +p = a.LEAVES; + +export CARDINAL_NUMBER_NAME = Optimize[fg @ (p @ cardinal_l)]; + +export ORDINAL_NUMBER_NAME = Optimize[fg @ (p @ ordinal_l)]; diff --git a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/numbers.grm b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/numbers.grm new file mode 100644 index 000000000..b25f1fb67 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/numbers.grm @@ -0,0 +1,68 @@ +# Copyright 2017 Google Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import 'ru/verbalizer/number_names.grm' as n; +import 'universal/thousands_punct.grm' as t; +import 'util/byte.grm' as b; + +nominatives = StringFile['ru/verbalizer/nominatives.tsv']; + +sigma_star = b.kBytes*; + +nominative_filter = + CDRewrite[nominatives ("" : "" <-1>), "[BOS]" | " ", " " | "[EOS]", sigma_star] +; + +cardinal = n.CARDINAL_NUMBER_NAME; +ordinal = n.ORDINAL_NUMBER_NAME; + +# Putting these here since this grammar gets incorporated by all the others. + +func I[expr] { + return "" : expr; +} + +func D[expr] { + return expr : ""; +} + +# Since we know this is the default for Russian, it's fair game to set it. +separators = t.dot_thousands | t.no_delimiter; + +export CARDINAL_NUMBERS = Optimize[ + separators + @ cardinal +]; + +export ORDINAL_NUMBERS_UNMARKED = Optimize[ + separators + @ ordinal +]; + + +endings = StringFile['ru/verbalizer/ordinal_endings.tsv']; + +not_dash = (b.kBytes - "-")+; +del_ending = CDRewrite[("-" not_dash) : "", "", "[EOS]", sigma_star]; + +# Needs nominative_filter here if we take out Kyle's models. +export ORDINAL_NUMBERS_MARKED = Optimize[ + Optimize[Optimize[separators @ ordinal] "-" not_dash] + @ Optimize[sigma_star endings] + @ del_ending] +; + +export ORDINAL_NUMBERS = + Optimize[ORDINAL_NUMBERS_MARKED | ORDINAL_NUMBERS_UNMARKED] +; diff --git a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/numbers_plus.grm b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/numbers_plus.grm new file mode 100644 index 000000000..dd000b3b9 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/numbers_plus.grm @@ -0,0 +1,133 @@ +# Copyright 2017 Google Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +# Grammar for things built mostly on numbers. + +import 'ru/verbalizer/factorization.grm' as f; +import 'ru/verbalizer/lexical_map.grm' as l; +import 'ru/verbalizer/numbers.grm' as n; + +num = n.CARDINAL_NUMBERS; +ord = n.ORDINAL_NUMBERS_UNMARKED; +digits = f.FRACTIONAL_PART_UNGROUPED; + +# Various symbols. + +plus = "+" : "@@ARITHMETIC_PLUS@@"; +minus = "-" : "@@ARITHMETIC_MINUS@@"; +slash = "/" : "@@SLASH@@"; +dot = "." : "@@URL_DOT_EXPRESSION@@"; +dash = "-" : "@@DASH@@"; +equals = "=" : "@@ARITHMETIC_EQUALS@@"; + +degree = "°" : "@@DEGREE@@"; + +division = ("/" | "÷") : "@@ARITHMETIC_DIVISION@@"; + +times = ("x" | "*") : "@@ARITHMETIC_TIMES@@"; + +power = "^" : "@@DECIMAL_EXPONENT@@"; + +square_root = "√" : "@@SQUARE_ROOT@@"; + +percent = "%" : "@@PERCENT@@"; + +# Safe roman numbers. + +# NB: Do not change the formatting here. NO_EDIT must be on the same +# line as the path. +rfile = + 'universal/roman_numerals.tsv' # NO_EDIT +; + +roman = StringFile[rfile]; + +## Main categories. + +cat_dot_number = + num + n.I[" "] dot n.I[" "] num + (n.I[" "] dot n.I[" "] num)+ +; + +cat_slash_number = + num + n.I[" "] slash n.I[" "] num + (n.I[" "] slash n.I[" "] num)* +; + +cat_dash_number = + num + n.I[" "] dash n.I[" "] num + (n.I[" "] dash n.I[" "] num)* +; + +cat_signed_number = ((plus | minus) n.I[" "])? num; + +cat_degree = cat_signed_number n.I[" "] degree; + +cat_country_code = plus n.I[" "] (num | digits); + +cat_math_operations = + plus + | minus + | division + | times + | equals + | percent + | power + | square_root +; + +# Roman numbers are often either cardinals or ordinals in various languages. +cat_roman = roman @ (num | ord); + +# Allow +# +# number:number +# number-number +# +# to just be +# +# number number. + +cat_number_number = + num ((":" | "-") : " ") num +; + +# Some additional readings for these symbols. + +cat_additional_readings = + ("/" : "@@PER@@") | + ("+" : "@@AND@@") | + ("-" : ("@@HYPHEN@@" | "@@CONNECTOR_TO@@")) | + ("*" : "@@STAR@@") | + ("x" : ("x" | "@@CONNECTOR_BY@@")) | + ("@" : "@@AT@@") +; + +numbers_plus = Optimize[ + cat_dot_number + | cat_slash_number + | cat_dash_number + | cat_signed_number + | cat_degree + | cat_country_code + | cat_math_operations + | cat_roman + | cat_number_number + | cat_additional_readings +]; + +export NUMBERS_PLUS = Optimize[numbers_plus @ l.LEXICAL_MAP]; diff --git a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/ordinal_endings.tsv b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/ordinal_endings.tsv new file mode 100644 index 000000000..6db35e26d --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/ordinal_endings.tsv @@ -0,0 +1,39 @@ +ая-ая +ого-го +ьего-го +ьего-его +ьей-ей +ьему-ему +ьем-ем +ое-е +ые-е +ье-е +ий-ий +ьими-ими +ьим-им +ьих-их +ьи-и +ий-й +ой-й +ый-й +ыми-ми +ьими-ми +ому-му +ьему-му +ого-ого +ое-ое +ой-ой +ом-ом +ому-ому +ую-ую +ых-х +ьих-х +ые-ые +ый-ый +ыми-ыми +ым-ым +ых-ых +ую-ю +ью-ю +ая-я +ья-я diff --git a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/ordinals-lex.grm b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/ordinals-lex.grm new file mode 100644 index 000000000..ca4d86d07 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/ordinals-lex.grm @@ -0,0 +1,804 @@ +# Copyright 2017 Google Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# AUTOMATICALLY GENERATED: DO NOT EDIT. +import 'util/byte.grm' as b; + +# Utilities for insertion and deletion. + +func I[expr] { + return "" : expr; +} + +func D[expr] { + return expr : ""; +} + +# Powers of base 10. +export POWERS = + "[E15]" + | "[E14]" + | "[E13]" + | "[E12]" + | "[E11]" + | "[E10]" + | "[E9]" + | "[E8]" + | "[E7]" + | "[E6]" + | "[E5]" + | "[E4]" + | "[E3]" + | "[E2]" + | "[E1]" +; + +export SIGMA = b.kBytes | POWERS; + +export SIGMA_STAR = SIGMA*; + +export SIGMA_PLUS = SIGMA+; + +################################################################################ +# BEGIN LANGUAGE SPECIFIC DATA +revaluations = + ("[E4]" : "[E1]") + | ("[E5]" : "[E2]") + | ("[E7]" : "[E1]") + | ("[E8]" : "[E2]") +; + +Ms = "[E3]" | "[E6]" | "[E9]"; + + +func Zero[expr] { + return expr : (""); +} + +space = " "; + +lexset3 = Optimize[ + ("1[E1]+1" : "одиннадцатая@") + | ("1[E1]+1" : "одиннадцати") + | ("1[E1]+1" : "одиннадцатого@") + | ("1[E1]+1" : "одиннадцатое@") + | ("1[E1]+1" : "одиннадцатой@") + | ("1[E1]+1" : "одиннадцатом@") + | ("1[E1]+1" : "одиннадцатому@") + | ("1[E1]+1" : "одиннадцатую@") + | ("1[E1]+1" : "одиннадцатые@") + | ("1[E1]+1" : "одиннадцатый@") + | ("1[E1]+1" : "одиннадцатым@") + | ("1[E1]+1" : "одиннадцатыми@") + | ("1[E1]+1" : "одиннадцатых@") + | ("1[E1]+1" : "одиннадцать") + | ("1[E1]+1" : "одиннадцатью") + | ("1[E1]+2" : "двенадцатая@") + | ("1[E1]+2" : "двенадцати") + | ("1[E1]+2" : "двенадцатого@") + | ("1[E1]+2" : "двенадцатое@") + | ("1[E1]+2" : "двенадцатой@") + | ("1[E1]+2" : "двенадцатом@") + | ("1[E1]+2" : "двенадцатому@") + | ("1[E1]+2" : "двенадцатую@") + | ("1[E1]+2" : "двенадцатые@") + | ("1[E1]+2" : "двенадцатый@") + | ("1[E1]+2" : "двенадцатым@") + | ("1[E1]+2" : "двенадцатыми@") + | ("1[E1]+2" : "двенадцатых@") + | ("1[E1]+2" : "двенадцать") + | ("1[E1]+2" : "двенадцатью") + | ("1[E1]+3" : "тринадцатая@") + | ("1[E1]+3" : "тринадцати") + | ("1[E1]+3" : "тринадцатого@") + | ("1[E1]+3" : "тринадцатое@") + | ("1[E1]+3" : "тринадцатой@") + | ("1[E1]+3" : "тринадцатом@") + | ("1[E1]+3" : "тринадцатому@") + | ("1[E1]+3" : "тринадцатую@") + | ("1[E1]+3" : "тринадцатые@") + | ("1[E1]+3" : "тринадцатый@") + | ("1[E1]+3" : "тринадцатым@") + | ("1[E1]+3" : "тринадцатыми@") + | ("1[E1]+3" : "тринадцатых@") + | ("1[E1]+3" : "тринадцать") + | ("1[E1]+3" : "тринадцатью") + | ("1[E1]+4" : "четырнадцатая@") + | ("1[E1]+4" : "четырнадцати") + | ("1[E1]+4" : "четырнадцатого@") + | ("1[E1]+4" : "четырнадцатое@") + | ("1[E1]+4" : "четырнадцатой@") + | ("1[E1]+4" : "четырнадцатом@") + | ("1[E1]+4" : "четырнадцатому@") + | ("1[E1]+4" : "четырнадцатую@") + | ("1[E1]+4" : "четырнадцатые@") + | ("1[E1]+4" : "четырнадцатый@") + | ("1[E1]+4" : "четырнадцатым@") + | ("1[E1]+4" : "четырнадцатыми@") + | ("1[E1]+4" : "четырнадцатых@") + | ("1[E1]+4" : "четырнадцать") + | ("1[E1]+4" : "четырнадцатью") + | ("1[E1]+5" : "пятнадцатая@") + | ("1[E1]+5" : "пятнадцати") + | ("1[E1]+5" : "пятнадцатого@") + | ("1[E1]+5" : "пятнадцатое@") + | ("1[E1]+5" : "пятнадцатой@") + | ("1[E1]+5" : "пятнадцатом@") + | ("1[E1]+5" : "пятнадцатому@") + | ("1[E1]+5" : "пятнадцатую@") + | ("1[E1]+5" : "пятнадцатые@") + | ("1[E1]+5" : "пятнадцатый@") + | ("1[E1]+5" : "пятнадцатым@") + | ("1[E1]+5" : "пятнадцатыми@") + | ("1[E1]+5" : "пятнадцатых@") + | ("1[E1]+5" : "пятнадцать") + | ("1[E1]+5" : "пятнадцатью") + | ("1[E1]+6" : "шестнадцатая@") + | ("1[E1]+6" : "шестнадцати") + | ("1[E1]+6" : "шестнадцатого@") + | ("1[E1]+6" : "шестнадцатое@") + | ("1[E1]+6" : "шестнадцатой@") + | ("1[E1]+6" : "шестнадцатом@") + | ("1[E1]+6" : "шестнадцатому@") + | ("1[E1]+6" : "шестнадцатую@") + | ("1[E1]+6" : "шестнадцатые@") + | ("1[E1]+6" : "шестнадцатый@") + | ("1[E1]+6" : "шестнадцатым@") + | ("1[E1]+6" : "шестнадцатыми@") + | ("1[E1]+6" : "шестнадцатых@") + | ("1[E1]+6" : "шестнадцать") + | ("1[E1]+6" : "шестнадцатью") + | ("1[E1]+7" : "семнадцатая@") + | ("1[E1]+7" : "семнадцати") + | ("1[E1]+7" : "семнадцатого@") + | ("1[E1]+7" : "семнадцатое@") + | ("1[E1]+7" : "семнадцатой@") + | ("1[E1]+7" : "семнадцатом@") + | ("1[E1]+7" : "семнадцатому@") + | ("1[E1]+7" : "семнадцатую@") + | ("1[E1]+7" : "семнадцатые@") + | ("1[E1]+7" : "семнадцатый@") + | ("1[E1]+7" : "семнадцатым@") + | ("1[E1]+7" : "семнадцатыми@") + | ("1[E1]+7" : "семнадцатых@") + | ("1[E1]+7" : "семнадцать") + | ("1[E1]+7" : "семнадцатью") + | ("1[E1]+8" : "восемнадцатая@") + | ("1[E1]+8" : "восемнадцати") + | ("1[E1]+8" : "восемнадцатого@") + | ("1[E1]+8" : "восемнадцатое@") + | ("1[E1]+8" : "восемнадцатой@") + | ("1[E1]+8" : "восемнадцатом@") + | ("1[E1]+8" : "восемнадцатому@") + | ("1[E1]+8" : "восемнадцатую@") + | ("1[E1]+8" : "восемнадцатые@") + | ("1[E1]+8" : "восемнадцатый@") + | ("1[E1]+8" : "восемнадцатым@") + | ("1[E1]+8" : "восемнадцатыми@") + | ("1[E1]+8" : "восемнадцатых@") + | ("1[E1]+8" : "восемнадцать") + | ("1[E1]+8" : "восемнадцатью") + | ("1[E1]+9" : "девятнадцатая@") + | ("1[E1]+9" : "девятнадцати") + | ("1[E1]+9" : "девятнадцатого@") + | ("1[E1]+9" : "девятнадцатое@") + | ("1[E1]+9" : "девятнадцатой@") + | ("1[E1]+9" : "девятнадцатом@") + | ("1[E1]+9" : "девятнадцатому@") + | ("1[E1]+9" : "девятнадцатую@") + | ("1[E1]+9" : "девятнадцатые@") + | ("1[E1]+9" : "девятнадцатый@") + | ("1[E1]+9" : "девятнадцатым@") + | ("1[E1]+9" : "девятнадцатыми@") + | ("1[E1]+9" : "девятнадцатых@") + | ("1[E1]+9" : "девятнадцать") + | ("1[E1]+9" : "девятнадцатью")] +; + +lex3 = CDRewrite[lexset3 I[space], "", "", SIGMA_STAR]; + +lexset2 = Optimize[ + ("1[E1]" : "десятая@") + | ("1[E1]" : "десяти") + | ("1[E1]" : "десятого@") + | ("1[E1]" : "десятое@") + | ("1[E1]" : "десятой@") + | ("1[E1]" : "десятом@") + | ("1[E1]" : "десятому@") + | ("1[E1]" : "десятую@") + | ("1[E1]" : "десятые@") + | ("1[E1]" : "десятый@") + | ("1[E1]" : "десятым@") + | ("1[E1]" : "десятыми@") + | ("1[E1]" : "десятых@") + | ("1[E1]" : "десять") + | ("1[E1]" : "десятью") + | ("1[E2]" : "сотая@") + | ("1[E2]" : "сотого@") + | ("1[E2]" : "сотое@") + | ("1[E2]" : "сотой@") + | ("1[E2]" : "сотом@") + | ("1[E2]" : "сотому@") + | ("1[E2]" : "сотую@") + | ("1[E2]" : "сотые@") + | ("1[E2]" : "сотый@") + | ("1[E2]" : "сотым@") + | ("1[E2]" : "сотыми@") + | ("1[E2]" : "сотых@") + | ("1[E2]" : "ста") + | ("1[E2]" : "сто") + | ("1[E3]" : "тысячная@") + | ("1[E3]" : "тысячного@") + | ("1[E3]" : "тысячное@") + | ("1[E3]" : "тысячной@") + | ("1[E3]" : "тысячном@") + | ("1[E3]" : "тысячному@") + | ("1[E3]" : "тысячную@") + | ("1[E3]" : "тысячные@") + | ("1[E3]" : "тысячный@") + | ("1[E3]" : "тысячным@") + | ("1[E3]" : "тысячными@") + | ("1[E3]" : "тысячных@") + | ("1[E6]" : "миллионная@") + | ("1[E6]" : "миллионного@") + | ("1[E6]" : "миллионное@") + | ("1[E6]" : "миллионной@") + | ("1[E6]" : "миллионном@") + | ("1[E6]" : "миллионному@") + | ("1[E6]" : "миллионную@") + | ("1[E6]" : "миллионные@") + | ("1[E6]" : "миллионный@") + | ("1[E6]" : "миллионным@") + | ("1[E6]" : "миллионными@") + | ("1[E6]" : "миллионных@") + | ("1[E9]" : "миллиардная@") + | ("1[E9]" : "миллиардного@") + | ("1[E9]" : "миллиардное@") + | ("1[E9]" : "миллиардной@") + | ("1[E9]" : "миллиардном@") + | ("1[E9]" : "миллиардному@") + | ("1[E9]" : "миллиардную@") + | ("1[E9]" : "миллиардные@") + | ("1[E9]" : "миллиардный@") + | ("1[E9]" : "миллиардным@") + | ("1[E9]" : "миллиардными@") + | ("1[E9]" : "миллиардных@") + | ("2[E1]" : "двадцатая@") + | ("2[E1]" : "двадцати") + | ("2[E1]" : "двадцатого@") + | ("2[E1]" : "двадцатое@") + | ("2[E1]" : "двадцатой@") + | ("2[E1]" : "двадцатом@") + | ("2[E1]" : "двадцатому@") + | ("2[E1]" : "двадцатую@") + | ("2[E1]" : "двадцатые@") + | ("2[E1]" : "двадцатый@") + | ("2[E1]" : "двадцатым@") + | ("2[E1]" : "двадцатыми@") + | ("2[E1]" : "двадцатых@") + | ("2[E1]" : "двадцать") + | ("2[E1]" : "двадцатью") + | ("2[E2]" : "двести") + | ("2[E2]" : "двумстам") + | ("2[E2]" : "двумястами") + | ("2[E2]" : "двухсот") + | ("2[E2]" : "двухсотая@") + | ("2[E2]" : "двухсотого@") + | ("2[E2]" : "двухсотое@") + | ("2[E2]" : "двухсотой@") + | ("2[E2]" : "двухсотом@") + | ("2[E2]" : "двухсотому@") + | ("2[E2]" : "двухсотую@") + | ("2[E2]" : "двухсотые@") + | ("2[E2]" : "двухсотый@") + | ("2[E2]" : "двухсотым@") + | ("2[E2]" : "двухсотыми@") + | ("2[E2]" : "двухсотых@") + | ("2[E2]" : "двухстах") + | ("3[E1]" : "тридцатая@") + | ("3[E1]" : "тридцати") + | ("3[E1]" : "тридцатого@") + | ("3[E1]" : "тридцатое@") + | ("3[E1]" : "тридцатой@") + | ("3[E1]" : "тридцатом@") + | ("3[E1]" : "тридцатому@") + | ("3[E1]" : "тридцатую@") + | ("3[E1]" : "тридцатые@") + | ("3[E1]" : "тридцатый@") + | ("3[E1]" : "тридцатым@") + | ("3[E1]" : "тридцатыми@") + | ("3[E1]" : "тридцатых@") + | ("3[E1]" : "тридцать") + | ("3[E1]" : "тридцатью") + | ("3[E2]" : "тремстам") + | ("3[E2]" : "тремястами") + | ("3[E2]" : "трехсот") + | ("3[E2]" : "трехсотая@") + | ("3[E2]" : "трехсотого@") + | ("3[E2]" : "трехсотое@") + | ("3[E2]" : "трехсотой@") + | ("3[E2]" : "трехсотом@") + | ("3[E2]" : "трехсотому@") + | ("3[E2]" : "трехсотую@") + | ("3[E2]" : "трехсотые@") + | ("3[E2]" : "трехсотый@") + | ("3[E2]" : "трехсотым@") + | ("3[E2]" : "трехсотыми@") + | ("3[E2]" : "трехсотых@") + | ("3[E2]" : "трехстах") + | ("3[E2]" : "триста") + | ("4[E1]" : "сорок") + | ("4[E1]" : "сорока") + | ("4[E1]" : "сороковая@") + | ("4[E1]" : "сорокового@") + | ("4[E1]" : "сороковое@") + | ("4[E1]" : "сороковой@") + | ("4[E1]" : "сороковом@") + | ("4[E1]" : "сороковому@") + | ("4[E1]" : "сороковую@") + | ("4[E1]" : "сороковые@") + | ("4[E1]" : "сороковым@") + | ("4[E1]" : "сороковыми@") + | ("4[E1]" : "сороковых@") + | ("4[E2]" : "четыремстам") + | ("4[E2]" : "четыреста") + | ("4[E2]" : "четырехсот") + | ("4[E2]" : "четырехсотая@") + | ("4[E2]" : "четырехсотого@") + | ("4[E2]" : "четырехсотое@") + | ("4[E2]" : "четырехсотой@") + | ("4[E2]" : "четырехсотом@") + | ("4[E2]" : "четырехсотому@") + | ("4[E2]" : "четырехсотую@") + | ("4[E2]" : "четырехсотые@") + | ("4[E2]" : "четырехсотый@") + | ("4[E2]" : "четырехсотым@") + | ("4[E2]" : "четырехсотыми@") + | ("4[E2]" : "четырехсотых@") + | ("4[E2]" : "четырехстах") + | ("4[E2]" : "четырьмястами") + | ("5[E1]" : "пятидесятая@") + | ("5[E1]" : "пятидесяти") + | ("5[E1]" : "пятидесятого@") + | ("5[E1]" : "пятидесятое@") + | ("5[E1]" : "пятидесятой@") + | ("5[E1]" : "пятидесятом@") + | ("5[E1]" : "пятидесятому@") + | ("5[E1]" : "пятидесятую@") + | ("5[E1]" : "пятидесятые@") + | ("5[E1]" : "пятидесятый@") + | ("5[E1]" : "пятидесятым@") + | ("5[E1]" : "пятидесятыми@") + | ("5[E1]" : "пятидесятых@") + | ("5[E1]" : "пятьдесят") + | ("5[E1]" : "пятьюдесятью") + | ("5[E2]" : "пятисот") + | ("5[E2]" : "пятисотая@") + | ("5[E2]" : "пятисотого@") + | ("5[E2]" : "пятисотое@") + | ("5[E2]" : "пятисотой@") + | ("5[E2]" : "пятисотом@") + | ("5[E2]" : "пятисотому@") + | ("5[E2]" : "пятисотую@") + | ("5[E2]" : "пятисотые@") + | ("5[E2]" : "пятисотый@") + | ("5[E2]" : "пятисотым@") + | ("5[E2]" : "пятисотыми@") + | ("5[E2]" : "пятисотых@") + | ("5[E2]" : "пятистам") + | ("5[E2]" : "пятистах") + | ("5[E2]" : "пятьсот") + | ("5[E2]" : "пятьюстами") + | ("6[E1]" : "шестидесятая@") + | ("6[E1]" : "шестидесяти") + | ("6[E1]" : "шестидесятого@") + | ("6[E1]" : "шестидесятое@") + | ("6[E1]" : "шестидесятой@") + | ("6[E1]" : "шестидесятом@") + | ("6[E1]" : "шестидесятому@") + | ("6[E1]" : "шестидесятую@") + | ("6[E1]" : "шестидесятые@") + | ("6[E1]" : "шестидесятый@") + | ("6[E1]" : "шестидесятым@") + | ("6[E1]" : "шестидесятыми@") + | ("6[E1]" : "шестидесятых@") + | ("6[E1]" : "шестьдесят") + | ("6[E1]" : "шестьюдесятью") + | ("6[E2]" : "шестисот") + | ("6[E2]" : "шестисотая@") + | ("6[E2]" : "шестисотого@") + | ("6[E2]" : "шестисотое@") + | ("6[E2]" : "шестисотой@") + | ("6[E2]" : "шестисотом@") + | ("6[E2]" : "шестисотому@") + | ("6[E2]" : "шестисотую@") + | ("6[E2]" : "шестисотые@") + | ("6[E2]" : "шестисотый@") + | ("6[E2]" : "шестисотым@") + | ("6[E2]" : "шестисотыми@") + | ("6[E2]" : "шестисотых@") + | ("6[E2]" : "шестистам") + | ("6[E2]" : "шестистах") + | ("6[E2]" : "шестьсот") + | ("6[E2]" : "шестьюстами") + | ("7[E1]" : "семидесятая@") + | ("7[E1]" : "семидесяти") + | ("7[E1]" : "семидесятого@") + | ("7[E1]" : "семидесятое@") + | ("7[E1]" : "семидесятой@") + | ("7[E1]" : "семидесятом@") + | ("7[E1]" : "семидесятому@") + | ("7[E1]" : "семидесятую@") + | ("7[E1]" : "семидесятые@") + | ("7[E1]" : "семидесятый@") + | ("7[E1]" : "семидесятым@") + | ("7[E1]" : "семидесятыми@") + | ("7[E1]" : "семидесятых@") + | ("7[E1]" : "семьдесят") + | ("7[E1]" : "семьюдесятью") + | ("7[E2]" : "семисот") + | ("7[E2]" : "семисотая@") + | ("7[E2]" : "семисотого@") + | ("7[E2]" : "семисотое@") + | ("7[E2]" : "семисотой@") + | ("7[E2]" : "семисотом@") + | ("7[E2]" : "семисотому@") + | ("7[E2]" : "семисотую@") + | ("7[E2]" : "семисотые@") + | ("7[E2]" : "семисотый@") + | ("7[E2]" : "семисотым@") + | ("7[E2]" : "семисотыми@") + | ("7[E2]" : "семисотых@") + | ("7[E2]" : "семистам") + | ("7[E2]" : "семистах") + | ("7[E2]" : "семьсот") + | ("7[E2]" : "семьюстами") + | ("8[E1]" : "восемьдесят") + | ("8[E1]" : "восьмидесятая@") + | ("8[E1]" : "восьмидесяти") + | ("8[E1]" : "восьмидесятого@") + | ("8[E1]" : "восьмидесятое@") + | ("8[E1]" : "восьмидесятой@") + | ("8[E1]" : "восьмидесятом@") + | ("8[E1]" : "восьмидесятому@") + | ("8[E1]" : "восьмидесятую@") + | ("8[E1]" : "восьмидесятые@") + | ("8[E1]" : "восьмидесятый@") + | ("8[E1]" : "восьмидесятым@") + | ("8[E1]" : "восьмидесятыми@") + | ("8[E1]" : "восьмидесятых@") + | ("8[E1]" : "восьмьюдесятью") + | ("8[E2]" : "восемьсот") + | ("8[E2]" : "восемьюстами") + | ("8[E2]" : "восьмисот") + | ("8[E2]" : "восьмисотая@") + | ("8[E2]" : "восьмисотого@") + | ("8[E2]" : "восьмисотое@") + | ("8[E2]" : "восьмисотой@") + | ("8[E2]" : "восьмисотом@") + | ("8[E2]" : "восьмисотому@") + | ("8[E2]" : "восьмисотую@") + | ("8[E2]" : "восьмисотые@") + | ("8[E2]" : "восьмисотый@") + | ("8[E2]" : "восьмисотым@") + | ("8[E2]" : "восьмисотыми@") + | ("8[E2]" : "восьмисотых@") + | ("8[E2]" : "восьмистам") + | ("8[E2]" : "восьмистах") + | ("8[E2]" : "восьмьюстами") + | ("9[E1]" : "девяноста") + | ("9[E1]" : "девяностая@") + | ("9[E1]" : "девяносто") + | ("9[E1]" : "девяностого@") + | ("9[E1]" : "девяностое@") + | ("9[E1]" : "девяностой@") + | ("9[E1]" : "девяностом@") + | ("9[E1]" : "девяностому@") + | ("9[E1]" : "девяностую@") + | ("9[E1]" : "девяностые@") + | ("9[E1]" : "девяностый@") + | ("9[E1]" : "девяностым@") + | ("9[E1]" : "девяностыми@") + | ("9[E1]" : "девяностых@") + | ("9[E2]" : "девятисот") + | ("9[E2]" : "девятисотая@") + | ("9[E2]" : "девятисотого@") + | ("9[E2]" : "девятисотое@") + | ("9[E2]" : "девятисотой@") + | ("9[E2]" : "девятисотом@") + | ("9[E2]" : "девятисотому@") + | ("9[E2]" : "девятисотую@") + | ("9[E2]" : "девятисотые@") + | ("9[E2]" : "девятисотый@") + | ("9[E2]" : "девятисотым@") + | ("9[E2]" : "девятисотыми@") + | ("9[E2]" : "девятисотых@") + | ("9[E2]" : "девятистам") + | ("9[E2]" : "девятистах") + | ("9[E2]" : "девятьсот") + | ("9[E2]" : "девятьюстами")] +; + +lex2 = CDRewrite[lexset2 I[space], "", "", SIGMA_STAR]; + +lexset1 = Optimize[ + ("+" : "") + | ("1" : "один") + | ("1" : "одна") + | ("1" : "одни") + | ("1" : "одним") + | ("1" : "одними") + | ("1" : "одних") + | ("1" : "одно") + | ("1" : "одного") + | ("1" : "одной") + | ("1" : "одном") + | ("1" : "одному") + | ("1" : "одною") + | ("1" : "одну") + | ("1" : "первая@") + | ("1" : "первого@") + | ("1" : "первое@") + | ("1" : "первой@") + | ("1" : "первом@") + | ("1" : "первому@") + | ("1" : "первую@") + | ("1" : "первые@") + | ("1" : "первый@") + | ("1" : "первым@") + | ("1" : "первыми@") + | ("1" : "первых@") + | ("2" : "вторая@") + | ("2" : "второго@") + | ("2" : "второе@") + | ("2" : "второй@") + | ("2" : "втором@") + | ("2" : "второму@") + | ("2" : "вторую@") + | ("2" : "вторые@") + | ("2" : "вторым@") + | ("2" : "вторыми@") + | ("2" : "вторых@") + | ("2" : "два") + | ("2" : "две") + | ("2" : "двум") + | ("2" : "двумя") + | ("2" : "двух") + | ("3" : "трем") + | ("3" : "тремя") + | ("3" : "третий@") + | ("3" : "третье@") + | ("3" : "третьего@") + | ("3" : "третьей@") + | ("3" : "третьем@") + | ("3" : "третьему@") + | ("3" : "третьи@") + | ("3" : "третьим@") + | ("3" : "третьими@") + | ("3" : "третьих@") + | ("3" : "третью@") + | ("3" : "третья@") + | ("3" : "трех") + | ("3" : "три") + | ("4" : "четвертая@") + | ("4" : "четвертого@") + | ("4" : "четвертое@") + | ("4" : "четвертой@") + | ("4" : "четвертом@") + | ("4" : "четвертому@") + | ("4" : "четвертую@") + | ("4" : "четвертые@") + | ("4" : "четвертый@") + | ("4" : "четвертым@") + | ("4" : "четвертыми@") + | ("4" : "четвертых@") + | ("4" : "четыре") + | ("4" : "четырем") + | ("4" : "четырех") + | ("4" : "четырьмя") + | ("5" : "пятая@") + | ("5" : "пяти") + | ("5" : "пятого@") + | ("5" : "пятое@") + | ("5" : "пятой@") + | ("5" : "пятом@") + | ("5" : "пятому@") + | ("5" : "пятую@") + | ("5" : "пятые@") + | ("5" : "пятый@") + | ("5" : "пятым@") + | ("5" : "пятыми@") + | ("5" : "пятых@") + | ("5" : "пять") + | ("5" : "пятью") + | ("6" : "шестая@") + | ("6" : "шести") + | ("6" : "шестого@") + | ("6" : "шестое@") + | ("6" : "шестой@") + | ("6" : "шестом@") + | ("6" : "шестому@") + | ("6" : "шестую@") + | ("6" : "шестые@") + | ("6" : "шестым@") + | ("6" : "шестыми@") + | ("6" : "шестых@") + | ("6" : "шесть") + | ("6" : "шестью") + | ("7" : "седьмая@") + | ("7" : "седьмого@") + | ("7" : "седьмое@") + | ("7" : "седьмой@") + | ("7" : "седьмом@") + | ("7" : "седьмому@") + | ("7" : "седьмую@") + | ("7" : "седьмые@") + | ("7" : "седьмым@") + | ("7" : "седьмыми@") + | ("7" : "седьмых@") + | ("7" : "семи") + | ("7" : "семь") + | ("7" : "семью") + | ("8" : "восемь") + | ("8" : "восьмая@") + | ("8" : "восьми") + | ("8" : "восьмого@") + | ("8" : "восьмое@") + | ("8" : "восьмой@") + | ("8" : "восьмом@") + | ("8" : "восьмому@") + | ("8" : "восьмую@") + | ("8" : "восьмые@") + | ("8" : "восьмым@") + | ("8" : "восьмыми@") + | ("8" : "восьмых@") + | ("8" : "восьмью") + | ("9" : "девятая@") + | ("9" : "девяти") + | ("9" : "девятого@") + | ("9" : "девятое@") + | ("9" : "девятой@") + | ("9" : "девятом@") + | ("9" : "девятому@") + | ("9" : "девятую@") + | ("9" : "девятые@") + | ("9" : "девятый@") + | ("9" : "девятым@") + | ("9" : "девятыми@") + | ("9" : "девятых@") + | ("9" : "девять") + | ("9" : "девятью") + | ("[E3]" : "тысяч") + | ("[E3]" : "тысяча") + | ("[E3]" : "тысячам") + | ("[E3]" : "тысячами") + | ("[E3]" : "тысячах") + | ("[E3]" : "тысяче") + | ("[E3]" : "тысячей") + | ("[E3]" : "тысячи") + | ("[E3]" : "тысячу") + | ("[E3]" : "тысячью") + | ("[E6]" : "миллион") + | ("[E6]" : "миллиона") + | ("[E6]" : "миллионам") + | ("[E6]" : "миллионами") + | ("[E6]" : "миллионах") + | ("[E6]" : "миллионе") + | ("[E6]" : "миллионов") + | ("[E6]" : "миллионом") + | ("[E6]" : "миллиону") + | ("[E6]" : "миллионы") + | ("[E9]" : "миллиард") + | ("[E9]" : "миллиарда") + | ("[E9]" : "миллиардам") + | ("[E9]" : "миллиардами") + | ("[E9]" : "миллиардах") + | ("[E9]" : "миллиарде") + | ("[E9]" : "миллиардов") + | ("[E9]" : "миллиардом") + | ("[E9]" : "миллиарду") + | ("[E9]" : "миллиарды") + | ("|0|" : "ноле") + | ("|0|" : "нолем") + | ("|0|" : "ноль") + | ("|0|" : "нолю") + | ("|0|" : "ноля") + | ("|0|" : "нуле") + | ("|0|" : "нулем") + | ("|0|" : "нуль") + | ("|0|" : "нулю") + | ("|0|" : "нуля")] +; + +lex1 = CDRewrite[lexset1 I[space], "", "", SIGMA_STAR]; + +export LEX = Optimize[lex3 @ lex2 @ lex1]; + +export INDEPENDENT_EXPONENTS = "[E3]" | "[E6]" | "[E9]"; + +# END LANGUAGE SPECIFIC DATA +################################################################################ +# Inserts a marker after the Ms. +export INSERT_BOUNDARY = CDRewrite["" : "%", Ms, "", SIGMA_STAR]; + +# Deletes all powers and "+". +export DELETE_POWERS = CDRewrite[D[POWERS | "+"], "", "", SIGMA_STAR]; + +# Deletes trailing zeros at the beginning of a number, so that "0003" does not +# get treated as an ordinary number. +export DELETE_INITIAL_ZEROS = + CDRewrite[("0" POWERS "+") : "", "[BOS]", "", SIGMA_STAR] +; + +NonMs = Optimize[POWERS - Ms]; + +# Deletes (usually) zeros before a non-M. E.g., +0[E1] should be +# deleted +export DELETE_INTERMEDIATE_ZEROS1 = + CDRewrite[Zero["+0" NonMs], "", "", SIGMA_STAR] +; + +# Deletes (usually) zeros before an M, if there is no non-zero element between +# that and the previous boundary. Thus, if after the result of the rule above we +# end up with "%+0[E3]", then that gets deleted. Also (really) deletes a final +# zero. +export DELETE_INTERMEDIATE_ZEROS2 = Optimize[ + CDRewrite[Zero["%+0" Ms], "", "", SIGMA_STAR] + @ CDRewrite[D["+0"], "", "[EOS]", SIGMA_STAR]] +; + +# Final clean up of stray zeros. +export DELETE_REMAINING_ZEROS = Optimize[ + CDRewrite[Zero["+0"], "", "", SIGMA_STAR] + @ CDRewrite[Zero["0"], "", "", SIGMA_STAR]] +; + +# Applies the revaluation map. For example in English, change [E4] to [E1] as a +# modifier of [E3] +export REVALUE = CDRewrite[revaluations, "", "", SIGMA_STAR]; + +# Deletes the various marks and powers in the input and output. +export DELETE_MARKS = CDRewrite[D["%" | "+" | POWERS], "", "", SIGMA_STAR]; + +export CLEAN_SPACES = Optimize[ + CDRewrite[" "+ : " ", b.kNotSpace, b.kNotSpace, SIGMA_STAR] + @ CDRewrite[" "* : "", "[BOS]", "", SIGMA_STAR] + @ CDRewrite[" "* : "", "", "[EOS]", SIGMA_STAR]] +; + +d = b.kDigit; + +# Germanic inversion rule. +germanic = + (I["1+"] d "[E1]" D["+1"]) + | (I["2+"] d "[E1]" D["+2"]) + | (I["3+"] d "[E1]" D["+3"]) + | (I["4+"] d "[E1]" D["+4"]) + | (I["5+"] d "[E1]" D["+5"]) + | (I["6+"] d "[E1]" D["+6"]) + | (I["7+"] d "[E1]" D["+7"]) + | (I["8+"] d "[E1]" D["+8"]) + | (I["9+"] d "[E1]" D["+9"]) +; + +germanic_inversion = + CDRewrite[germanic, "", "", SIGMA_STAR, 'ltr', 'opt'] +; + +export GERMANIC_INVERSION = SIGMA_STAR; +export ORDINAL_RESTRICTION = + Optimize[((SIGMA - "@")* "@") @ CDRewrite[D["@"], "", "", SIGMA_STAR]] +; +nondigits = b.kBytes - b.kDigit; +export ORDINAL_SUFFIX = D[nondigits*]; diff --git a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/ordinals.tsv b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/ordinals.tsv new file mode 100644 index 000000000..367e14b11 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/ordinals.tsv @@ -0,0 +1,527 @@ +0 нулевая +0 нулевого +0 нулевое +0 нулевой +0 нулевом +0 нулевому +0 нулевую +0 нулевые +0 нулевым +0 нулевым +0 нулевыми +0 нулевых +1 первая +1 первого +1 первое +1 первой +1 первом +1 первому +1 первую +1 первые +1 первый +1 первым +1 первым +1 первыми +1 первых +2 вторая +2 второго +2 второе +2 второй +2 втором +2 второму +2 вторую +2 вторые +2 вторым +2 вторым +2 вторыми +2 вторых +3 третий +3 третье +3 третьего +3 третьей +3 третьем +3 третьему +3 третьи +3 третьим +3 третьим +3 третьими +3 третьих +3 третью +3 третья +4 четвертая +4 четвертого +4 четвертое +4 четвертой +4 четвертом +4 четвертому +4 четвертую +4 четвертые +4 четвертый +4 четвертым +4 четвертым +4 четвертыми +4 четвертых +4 четвёртая +4 четвёртого +4 четвёртое +4 четвёртой +4 четвёртом +4 четвёртому +4 четвёртую +4 четвёртые +4 четвёртый +4 четвёртым +4 четвёртым +4 четвёртыми +4 четвёртых +5 пятая +5 пятого +5 пятое +5 пятой +5 пятом +5 пятому +5 пятую +5 пятые +5 пятый +5 пятым +5 пятым +5 пятыми +5 пятых +6 шестая +6 шестого +6 шестое +6 шестой +6 шестом +6 шестому +6 шестую +6 шестые +6 шестым +6 шестым +6 шестыми +6 шестых +7 седьмая +7 седьмого +7 седьмое +7 седьмой +7 седьмом +7 седьмому +7 седьмую +7 седьмые +7 седьмым +7 седьмым +7 седьмыми +7 седьмых +8 восьмая +8 восьмого +8 восьмое +8 восьмой +8 восьмом +8 восьмому +8 восьмую +8 восьмые +8 восьмым +8 восьмым +8 восьмыми +8 восьмых +9 девятая +9 девятого +9 девятое +9 девятой +9 девятом +9 девятому +9 девятую +9 девятые +9 девятый +9 девятым +9 девятым +9 девятыми +9 девятых +10 десятая +10 десятого +10 десятое +10 десятой +10 десятом +10 десятому +10 десятую +10 десятые +10 десятый +10 десятым +10 десятым +10 десятыми +10 десятых +11 одиннадцатая +11 одиннадцатого +11 одиннадцатое +11 одиннадцатой +11 одиннадцатом +11 одиннадцатому +11 одиннадцатую +11 одиннадцатые +11 одиннадцатый +11 одиннадцатым +11 одиннадцатым +11 одиннадцатыми +11 одиннадцатых +12 двенадцатая +12 двенадцатого +12 двенадцатое +12 двенадцатой +12 двенадцатом +12 двенадцатому +12 двенадцатую +12 двенадцатые +12 двенадцатый +12 двенадцатым +12 двенадцатым +12 двенадцатыми +12 двенадцатых +13 тринадцатая +13 тринадцатого +13 тринадцатое +13 тринадцатой +13 тринадцатом +13 тринадцатому +13 тринадцатую +13 тринадцатые +13 тринадцатый +13 тринадцатым +13 тринадцатым +13 тринадцатыми +13 тринадцатых +14 четырнадцатая +14 четырнадцатого +14 четырнадцатое +14 четырнадцатой +14 четырнадцатом +14 четырнадцатому +14 четырнадцатую +14 четырнадцатые +14 четырнадцатый +14 четырнадцатым +14 четырнадцатым +14 четырнадцатыми +14 четырнадцатых +15 пятнадцатая +15 пятнадцатого +15 пятнадцатое +15 пятнадцатой +15 пятнадцатом +15 пятнадцатому +15 пятнадцатую +15 пятнадцатые +15 пятнадцатый +15 пятнадцатым +15 пятнадцатым +15 пятнадцатыми +15 пятнадцатых +16 шестнадцатая +16 шестнадцатого +16 шестнадцатое +16 шестнадцатой +16 шестнадцатом +16 шестнадцатому +16 шестнадцатую +16 шестнадцатые +16 шестнадцатый +16 шестнадцатым +16 шестнадцатым +16 шестнадцатыми +16 шестнадцатых +17 семнадцатая +17 семнадцатого +17 семнадцатое +17 семнадцатой +17 семнадцатом +17 семнадцатому +17 семнадцатую +17 семнадцатые +17 семнадцатый +17 семнадцатым +17 семнадцатым +17 семнадцатыми +17 семнадцатых +18 восемнадцатая +18 восемнадцатого +18 восемнадцатое +18 восемнадцатой +18 восемнадцатом +18 восемнадцатому +18 восемнадцатую +18 восемнадцатые +18 восемнадцатый +18 восемнадцатым +18 восемнадцатым +18 восемнадцатыми +18 восемнадцатых +19 девятнадцатая +19 девятнадцатого +19 девятнадцатое +19 девятнадцатой +19 девятнадцатом +19 девятнадцатому +19 девятнадцатую +19 девятнадцатые +19 девятнадцатый +19 девятнадцатым +19 девятнадцатым +19 девятнадцатыми +19 девятнадцатых +20 двадцатая +20 двадцатого +20 двадцатое +20 двадцатой +20 двадцатом +20 двадцатому +20 двадцатую +20 двадцатые +20 двадцатый +20 двадцатым +20 двадцатым +20 двадцатыми +20 двадцатых +30 тридцатая +30 тридцатого +30 тридцатое +30 тридцатой +30 тридцатом +30 тридцатому +30 тридцатую +30 тридцатые +30 тридцатый +30 тридцатым +30 тридцатым +30 тридцатыми +30 тридцатых +40 сороковая +40 сорокового +40 сороковое +40 сороковой +40 сороковом +40 сороковому +40 сороковую +40 сороковые +40 сороковым +40 сороковым +40 сороковыми +40 сороковых +50 пятидесятая +50 пятидесятого +50 пятидесятое +50 пятидесятой +50 пятидесятом +50 пятидесятому +50 пятидесятую +50 пятидесятые +50 пятидесятый +50 пятидесятым +50 пятидесятым +50 пятидесятыми +50 пятидесятых +60 шестидесятая +60 шестидесятого +60 шестидесятое +60 шестидесятой +60 шестидесятом +60 шестидесятому +60 шестидесятую +60 шестидесятые +60 шестидесятый +60 шестидесятым +60 шестидесятым +60 шестидесятыми +60 шестидесятых +70 семидесятая +70 семидесятого +70 семидесятое +70 семидесятой +70 семидесятом +70 семидесятому +70 семидесятую +70 семидесятые +70 семидесятый +70 семидесятым +70 семидесятым +70 семидесятыми +70 семидесятых +80 восьмидесятая +80 восьмидесятого +80 восьмидесятое +80 восьмидесятой +80 восьмидесятом +80 восьмидесятому +80 восьмидесятую +80 восьмидесятые +80 восьмидесятый +80 восьмидесятым +80 восьмидесятым +80 восьмидесятыми +80 восьмидесятых +90 девяностая +90 девяностого +90 девяностое +90 девяностой +90 девяностом +90 девяностому +90 девяностую +90 девяностые +90 девяностый +90 девяностым +90 девяностым +90 девяностыми +90 девяностых +100 сотая +100 сотого +100 сотое +100 сотой +100 сотом +100 сотому +100 сотую +100 сотые +100 сотый +100 сотым +100 сотым +100 сотыми +100 сотых +200 двухсотая +200 двухсотого +200 двухсотое +200 двухсотой +200 двухсотом +200 двухсотому +200 двухсотую +200 двухсотые +200 двухсотый +200 двухсотым +200 двухсотым +200 двухсотыми +200 двухсотых +300 трехсотая +300 трехсотого +300 трехсотое +300 трехсотой +300 трехсотом +300 трехсотому +300 трехсотую +300 трехсотые +300 трехсотый +300 трехсотым +300 трехсотым +300 трехсотыми +300 трехсотых +400 четырехсотая +400 четырехсотого +400 четырехсотое +400 четырехсотой +400 четырехсотом +400 четырехсотому +400 четырехсотую +400 четырехсотые +400 четырехсотый +400 четырехсотым +400 четырехсотым +400 четырехсотыми +400 четырехсотых +500 пятисотая +500 пятисотого +500 пятисотое +500 пятисотой +500 пятисотом +500 пятисотому +500 пятисотую +500 пятисотые +500 пятисотый +500 пятисотым +500 пятисотым +500 пятисотыми +500 пятисотых +600 шестисотая +600 шестисотого +600 шестисотое +600 шестисотой +600 шестисотом +600 шестисотому +600 шестисотую +600 шестисотые +600 шестисотый +600 шестисотым +600 шестисотым +600 шестисотыми +600 шестисотых +700 семисотая +700 семисотого +700 семисотое +700 семисотой +700 семисотом +700 семисотому +700 семисотую +700 семисотые +700 семисотый +700 семисотым +700 семисотым +700 семисотыми +700 семисотых +800 восьмисотая +800 восьмисотого +800 восьмисотое +800 восьмисотой +800 восьмисотом +800 восьмисотому +800 восьмисотую +800 восьмисотые +800 восьмисотый +800 восьмисотым +800 восьмисотым +800 восьмисотыми +800 восьмисотых +900 девятисотая +900 девятисотого +900 девятисотое +900 девятисотой +900 девятисотом +900 девятисотому +900 девятисотую +900 девятисотые +900 девятисотый +900 девятисотым +900 девятисотым +900 девятисотыми +900 девятисотых +1000 тысячная +1000 тысячного +1000 тысячное +1000 тысячной +1000 тысячном +1000 тысячному +1000 тысячную +1000 тысячные +1000 тысячный +1000 тысячным +1000 тысячным +1000 тысячными +1000 тысячных +1000000 миллионная +1000000 миллионного +1000000 миллионное +1000000 миллионной +1000000 миллионном +1000000 миллионному +1000000 миллионную +1000000 миллионные +1000000 миллионный +1000000 миллионным +1000000 миллионным +1000000 миллионными +1000000 миллионных +1000000000 миллиардная +1000000000 миллиардного +1000000000 миллиардное +1000000000 миллиардной +1000000000 миллиардном +1000000000 миллиардному +1000000000 миллиардную +1000000000 миллиардные +1000000000 миллиардный +1000000000 миллиардным +1000000000 миллиардным +1000000000 миллиардными +1000000000 миллиардных diff --git a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/spelled.grm b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/spelled.grm new file mode 100644 index 000000000..123759ba9 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/spelled.grm @@ -0,0 +1,77 @@ +# Copyright 2017 Google Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +# This verbalizer is used whenever there is an LM symbol that consists of +# letters immediately followed by "{spelled}". This strips the "{spelled}" +# suffix. + +import 'util/byte.grm' as b; +import 'ru/classifier/cyrillic.grm' as c; +import 'ru/verbalizer/lexical_map.grm' as l; +import 'ru/verbalizer/numbers.grm' as n; + +digit = b.kDigit @ n.CARDINAL_NUMBERS; + +char_set = (("a" | "A") : "letter-a") + | (("b" | "B") : "letter-b") + | (("c" | "C") : "letter-c") + | (("d" | "D") : "letter-d") + | (("e" | "E") : "letter-e") + | (("f" | "F") : "letter-f") + | (("g" | "G") : "letter-g") + | (("h" | "H") : "letter-h") + | (("i" | "I") : "letter-i") + | (("j" | "J") : "letter-j") + | (("k" | "K") : "letter-k") + | (("l" | "L") : "letter-l") + | (("m" | "M") : "letter-m") + | (("n" | "N") : "letter-n") + | (("o" | "O") : "letter-o") + | (("p" | "P") : "letter-p") + | (("q" | "Q") : "letter-q") + | (("r" | "R") : "letter-r") + | (("s" | "S") : "letter-s") + | (("t" | "T") : "letter-t") + | (("u" | "U") : "letter-u") + | (("v" | "V") : "letter-v") + | (("w" | "W") : "letter-w") + | (("x" | "X") : "letter-x") + | (("y" | "Y") : "letter-y") + | (("z" | "Z") : "letter-z") + | (digit) + | ("&" : "@@AND@@") + | ("." : "") + | ("-" : "") + | ("_" : "") + | ("/" : "") + | (n.I["letter-"] c.kCyrillicAlpha) + ; + +ins_space = "" : " "; + +suffix = "{spelled}" : ""; + +spelled = Optimize[char_set (ins_space char_set)* suffix]; + +export SPELLED = Optimize[spelled @ l.LEXICAL_MAP]; + +sigma_star = b.kBytes*; + +# Gets rid of the letter- prefix since in some cases we don't want it. + +del_letter = CDRewrite[n.D["letter-"], "", "", sigma_star]; + +spelled_no_tag = Optimize[char_set (ins_space char_set)*]; + +export SPELLED_NO_LETTER = Optimize[spelled_no_tag @ del_letter]; diff --git a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/spoken_punct.grm b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/spoken_punct.grm new file mode 100644 index 000000000..26a1bf27f --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/spoken_punct.grm @@ -0,0 +1,24 @@ +# Copyright 2017 Google Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import 'ru/verbalizer/lexical_map.grm' as l; + +punct = + ("." : "@@PERIOD@@") + | ("," : "@@COMMA@@") + | ("!" : "@@EXCLAMATION_MARK@@") + | ("?" : "@@QUESTION_MARK@@") +; + +export SPOKEN_PUNCT = Optimize[punct @ l.LEXICAL_MAP]; diff --git a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/time.grm b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/time.grm new file mode 100644 index 000000000..a416aba7d --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/time.grm @@ -0,0 +1,108 @@ +# Copyright 2017 Google Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import 'util/byte.grm' as b; +import 'ru/verbalizer/lexical_map.grm' as l; +import 'ru/verbalizer/numbers.grm' as n; + +# Only handles 24-hour time with quarter-to, half-past and quarter-past. + +increment_hour = + ("0" : "1") + | ("1" : "2") + | ("2" : "3") + | ("3" : "4") + | ("4" : "5") + | ("5" : "6") + | ("6" : "7") + | ("7" : "8") + | ("8" : "9") + | ("9" : "10") + | ("10" : "11") + | ("11" : "12") + | ("12" : "1") # If someone uses 12, we assume 12-hour by default. + | ("13" : "14") + | ("14" : "15") + | ("15" : "16") + | ("16" : "17") + | ("17" : "18") + | ("18" : "19") + | ("19" : "20") + | ("20" : "21") + | ("21" : "22") + | ("22" : "23") + | ("23" : "12") +; + +hours = Project[increment_hour, 'input']; + +d = b.kDigit; +D = d - "0"; + +minutes09 = "0" D; + +minutes = ("1" | "2" | "3" | "4" | "5") d; + +__sep__ = ":"; +sep_space = __sep__ : " "; + +verbalize_hours = hours @ n.CARDINAL_NUMBERS; + +verbalize_minutes = + ("00" : "@@HOUR@@") + | (minutes09 @ (("0" : "@@TIME_ZERO@@") n.I[" "] n.CARDINAL_NUMBERS)) + | (minutes @ n.CARDINAL_NUMBERS) +; + +time_basic = Optimize[verbalize_hours sep_space verbalize_minutes]; + +# Special cases we handle right now. +# TODO: Need to allow for cases like +# +# half twelve (in the UK English sense) +# half twaalf (in the Dutch sense) + +time_quarter_past = + n.I["@@TIME_QUARTER@@ @@TIME_AFTER@@ "] + verbalize_hours + n.D[__sep__ "15"]; + +time_half_past = + n.I["@@TIME_HALF@@ @@TIME_AFTER@@ "] + verbalize_hours + n.D[__sep__ "30"]; + +time_quarter_to = + n.I["@@TIME_QUARTER@@ @@TIME_BEFORE@@ "] + (increment_hour @ verbalize_hours) + n.D[__sep__ "45"]; + +time_extra = Optimize[ + time_quarter_past | time_half_past | time_quarter_to] +; + +# Basic time periods which most languages can be expected to have. +__am__ = "a.m." | "am" | "AM" | "утра"; +__pm__ = "p.m." | "pm" | "PM" | "вечера"; + +period = (__am__ : "@@TIME_AM@@") | (__pm__ : "@@TIME_PM@@"); + +time_variants = time_basic | time_extra; + +time = Optimize[ + (period (" " | n.I[" "]))? time_variants + | time_variants ((" " | n.I[" "]) period)?] +; + +export TIME = Optimize[time @ l.LEXICAL_MAP]; diff --git a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/urls.grm b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/urls.grm new file mode 100644 index 000000000..3039b6521 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/urls.grm @@ -0,0 +1,68 @@ +# Copyright 2017 Google Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +# Rules for URLs and email addresses. + +import 'util/byte.grm' as bytelib; +import 'ru/verbalizer/lexical_map.grm' as l; + +ins_space = "" : " "; +dot = "." : "@@URL_DOT_EXPRESSION@@"; +at = "@" : "@@AT@@"; + +url_suffix = + (".com" : dot ins_space "com") | + (".gov" : dot ins_space "gov") | + (".edu" : dot ins_space "e d u") | + (".org" : dot ins_space "org") | + (".net" : dot ins_space "net") +; + +letter_string = (bytelib.kAlnum)* bytelib.kAlnum; + +letter_string_dot = + ((letter_string ins_space dot ins_space)* letter_string) +; + +# Rules for URLs. +export URL = Optimize[ + ((letter_string_dot) (ins_space) + (url_suffix)) @ l.LEXICAL_MAP +]; + +# Rules for email addresses. +letter_by_letter = ((bytelib.kAlnum ins_space)* bytelib.kAlnum); + +letter_by_letter_dot = + ((letter_by_letter ins_space dot ins_space)* + letter_by_letter) +; + +export EMAIL1 = Optimize[ + ((letter_by_letter) (ins_space) + (at) (ins_space) + (letter_by_letter_dot) (ins_space) + (url_suffix)) @ l.LEXICAL_MAP +]; + +export EMAIL2 = Optimize[ + ((letter_by_letter) (ins_space) + (at) (ins_space) + (letter_string_dot) (ins_space) + (url_suffix)) @ l.LEXICAL_MAP +]; + +export EMAILS = Optimize[ + EMAIL1 | EMAIL2 +]; diff --git a/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/verbalizer.grm b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/verbalizer.grm new file mode 100644 index 000000000..ddd469685 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/ru/verbalizer/verbalizer.grm @@ -0,0 +1,42 @@ +# Copyright 2017 Google Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +import 'util/util.grm' as util; +import 'ru/verbalizer/extra_numbers.grm' as e; +import 'ru/verbalizer/float.grm' as f; +import 'ru/verbalizer/math.grm' as ma; +import 'ru/verbalizer/miscellaneous.grm' as mi; +import 'ru/verbalizer/money.grm' as mo; +import 'ru/verbalizer/numbers.grm' as n; +import 'ru/verbalizer/numbers_plus.grm' as np; +import 'ru/verbalizer/spelled.grm' as s; +import 'ru/verbalizer/spoken_punct.grm' as sp; +import 'ru/verbalizer/time.grm' as t; +import 'ru/verbalizer/urls.grm' as u; + +export VERBALIZER = Optimize[RmWeight[ + ( e.MIXED_NUMBERS + | e.DIGITS + | f.FLOAT + | ma.ARITHMETIC + | mi.MISCELLANEOUS + | mo.MONEY + | n.CARDINAL_NUMBERS + | n.ORDINAL_NUMBERS + | np.NUMBERS_PLUS + | s.SPELLED + | sp.SPOKEN_PUNCT + | t.TIME + | u.URL) @ util.CLEAN_SPACES +]]; diff --git a/third_party/chinese_text_normalization/thrax/src/universal/README.md b/third_party/chinese_text_normalization/thrax/src/universal/README.md new file mode 100644 index 000000000..33225f6da --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/universal/README.md @@ -0,0 +1,3 @@ +# Language-universal grammar definitions + +This directory contains various language-universal grammar definitions. diff --git a/third_party/chinese_text_normalization/thrax/src/universal/roman_numerals.tsv b/third_party/chinese_text_normalization/thrax/src/universal/roman_numerals.tsv new file mode 100644 index 000000000..98a8d97d9 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/universal/roman_numerals.tsv @@ -0,0 +1,91 @@ +i 1 +ii 2 +iii 3 +iv 4 +v 5 +vi 6 +vii 7 +viii 8 +ix 9 +x 10 +xi 11 +xii 12 +xiii 13 +xiv 14 +xv 15 +xvi 16 +xvii 17 +xviii 18 +xix 19 +xx 20 +xxi 21 +xxii 22 +xxiii 23 +xxiv 24 +xxv 25 +xxvi 26 +xxvii 27 +xxviii 28 +xxix 29 +xxx 30 +xxxi 31 +xxxii 32 +xxxiii 33 +xxxiv 34 +xxxv 35 +xxxvi 36 +xxxvii 37 +xxxviii 38 +xxxix 39 +xl 40 +xli 41 +xlii 42 +xliii 43 +xliv 44 +xlv 45 +xlvi 46 +xlvii 47 +xlviii 48 +xlix 49 +mcmxciv 1994 +mcmxcv 1995 +mcmxcvi 1996 +mcmxcvii 1997 +mcmxcviii 1998 +mcmxcix 1999 +mm 2000 +mmi 2001 +mmii 2002 +mmiii 2003 +mmiv 2004 +mmv 2005 +mmvi 2006 +mmvii 2007 +mmviii 2008 +mmix 2009 +mmx 2010 +mmxi 2011 +mmxii 2012 +mmxiii 2013 +mmxiv 2014 +mmxv 2015 +mmxvi 2016 +mmxvii 2017 +mmxviii 2018 +mmxix 2019 +mmxx 2020 +mmxxi 2021 +mmxxii 2022 +mmxxiii 2023 +mmxxiv 2024 +mmxxv 2025 +mmxxvi 2026 +mmxxvii 2027 +mmxxviii 2028 +mmxxix 2029 +mmxxx 2030 +mmxxxi 2031 +mmxxxii 2032 +mmxxxiii 2033 +mmxxxiv 2034 +mmxxxv 2035 diff --git a/third_party/chinese_text_normalization/thrax/src/universal/thousands_punct.grm b/third_party/chinese_text_normalization/thrax/src/universal/thousands_punct.grm new file mode 100644 index 000000000..90ce4a115 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/universal/thousands_punct.grm @@ -0,0 +1,126 @@ +# Copyright 2017 Google Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +# Specifies common ways of delimiting thousands in digit strings. + +import 'util/byte.grm' as bytelib; +import 'util/util.grm' as util; + +killcomma = "," : ""; +dot2comma = "." : ","; +spaces2comma = " "+ : ","; + +zero = "0"; + +# no_delimiter = zero | "[1-9][0-9]*"; +export no_delimiter = zero | (util.d1to9 bytelib.kDigit*); + +# delim_map_dot = ("[0-9]" | ("\." : ","))*; +delim_map_dot = (bytelib.kDigit | dot2comma)*; + +# delim_map_space = ("[0-9]" | (" +" : ","))*; +delim_map_space = (bytelib.kDigit | spaces2comma)*; + +## Western systems group thousands. Korean goes this way too. + +# comma_thousands = zero | ("[1-9][0-9]?[0-9]?" (("," : "") "[0-9][0-9][0-9]")*); +export comma_thousands = zero | (util.d1to9 bytelib.kDigit{0,2} (killcomma bytelib.kDigit{3})*); + +# ComposeFst: 1st argument cannot match on output labels and 2nd argument +# cannot match on input labels (sort?). +export dot_thousands = delim_map_dot @ comma_thousands; + +# ComposeFst: 1st argument cannot match on output labels and 2nd argument +# cannot match on input labels (sort?). +export space_thousands = delim_map_space @ comma_thousands; + +## Chinese prefers grouping by fours (by ten-thousands). + +# chinese_comma = +# zero | ("[1-9][0-9]?[0-9]?[0-9]?" (("," : "") "[0-9][0-9][0-9][0-9]")*); +export chinese_comma = zero | (util.d1to9 (bytelib.kDigit{0,3}) (killcomma bytelib.kDigit{4})*); + +## The Indian system is more complex because of the Stravinskian alternation +## between lakhs and crores. +## +## According to Wikipedia: +## +## Indian English Value +## One 1 +## Ten 10 +## Hundred 100 +## Thousand 1,000 +## Lakh 1,00,000 +## Crore 1,00,00,000 +## Arab 1,00,00,00,000 +## Kharab 1,00,00,00,00,000 + +# indian_hundreds = "[1-9][0-9]?[0-9]?"; +indian_hundreds = util.d1to9 bytelib.kDigit{0,2}; + +## Up to 99,999. + +# indian_comma_thousands = "[1-9][0-9]?" ("," : "") "[0-9][0-9][0-9]"; +indian_comma_thousands = util.d1to9 bytelib.kDigit? killcomma bytelib.kDigit{3}; + +## Up to 99,99,999. + +# indian_comma_lakhs = "[1-9][0-9]?" ("," : "") "[0-9][0-9]" ("," : "") "[0-9][0-9][0-9]"; +indian_comma_lakhs = util.d1to9 bytelib.kDigit? killcomma bytelib.kDigit{2} killcomma bytelib.kDigit{3}; + +## Up to 999,99,99,999 + +indian_comma_crores = + util.d1to9 bytelib.kDigit? bytelib.kDigit? killcomma + (bytelib.kDigit{2} killcomma)? + bytelib.kDigit{2} killcomma + bytelib.kDigit{3} +; + +## Up to 99,999,99,99,999. + +indian_comma_thousand_crores = + util.d1to9 bytelib.kDigit? killcomma + bytelib.kDigit{3} killcomma + bytelib.kDigit{2} killcomma + bytelib.kDigit{2} killcomma + bytelib.kDigit{3} +; + +## Up to 999,99,999,99,99,999. + +indian_comma_lakh_crores = + util.d1to9 bytelib.kDigit? bytelib.kDigit? killcomma + bytelib.kDigit{2} killcomma + bytelib.kDigit{3} killcomma + bytelib.kDigit{2} killcomma + bytelib.kDigit{2} killcomma + bytelib.kDigit{3} +; + +export indian_comma = + zero + | indian_hundreds + | indian_comma_thousands + | indian_comma_lakhs + | indian_comma_crores + | indian_comma_thousand_crores + | indian_comma_lakh_crores +; + +# Indian number system with dots. +export indian_dot_number = delim_map_dot @ indian_comma; + +# Indian number system with spaces. +export indian_space_number = delim_map_space @ indian_comma; diff --git a/third_party/chinese_text_normalization/thrax/src/util/README.md b/third_party/chinese_text_normalization/thrax/src/util/README.md new file mode 100644 index 000000000..9df3c8035 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/util/README.md @@ -0,0 +1,3 @@ +# Utility grammar definitions + +This directory contains various utility grammar definitions. diff --git a/third_party/chinese_text_normalization/thrax/src/util/arithmetic.grm b/third_party/chinese_text_normalization/thrax/src/util/arithmetic.grm new file mode 100644 index 000000000..b1396db8d --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/util/arithmetic.grm @@ -0,0 +1,326 @@ +# Copyright 2017 Google Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +# Basic arithmetic on S-expressions. Exported arithmetic transducers may either: +# +# * Support weak vigesimal addition and multiplication... +# +# (+ 20 17 +) -> 37 +# (+ 20 10 7 +) -> 37 +# (* 4 20 *) -> 80 +# +# ...or not. +# +# * Support "Germanic decade flop" addition.... +# +# (+ 8 20 +) -> 28 +# (+ 4 60 +) -> 64 +# +# ...or not. +# +# * Support multiplication where the left-hand side multiplicand is of a higher +# order than the right-hand side multiplicand. +# +# (* 1000 100) -> 100000 +# +# ...or not. +# +# However, modulo these exceptions, arithmetic transducers do not support +# addition that requires "carrying", or multiplication where the right-hand +# side multiplicand is not a power of ten. So this is not a *generic* +# S-expression evaluator. +# +# LEAVES is a transducer that accepts symbols in delta but deletes symbols +# in sigma - delta. So it essentially removes markup. +# +# REPEAT_FILTER is an acceptor which blocks derivations of the form +# +# (+ (* 50 1000 *) (* 4 1000) ...) "fifty thousand four thousand..." +# +# in languages where that is not licensed. + +import 'util/byte.grm' as b; + +# Deleter FST. +func D[expr] { + return expr : ""; +} + +delta = b.kDigit; +sigma = delta | " " | "(" | ")" | "+" | "*"; + +sigmastar = sigma*; +deltastar = delta*; + +rparen = Optimize["+)" | "*)"]; +space_or_rparen = Optimize[" " | rparen]; + +## Multiplication. + +# Generic multiplication where the RHS is a power of ten. + +del_one = Optimize[delta+ D[" 1"] "0"+]; + +test1_1 = AssertEqual["2 10" @ del_one, "20"]; +test1_2 = AssertEqual["20 10" @ del_one, "200"]; +test1_3 = AssertEqual["2 100" @ del_one, "200"]; +test1_4 = AssertEqual["20 100" @ del_one, "2000"]; +test1_5 = AssertEqual["200 100" @ del_one, "20000"]; +test1_6 = AssertEqual["2 1000" @ del_one, "2000"]; +test1_7 = AssertEqual["20 1000" @ del_one, "20000"]; +test1_8 = AssertEqual["200 1000" @ del_one, "200000"]; +test1_9 = AssertEqual["2000 1000" @ del_one, "2000000"]; + +# Generic multiplication where the RHS is a power of ten and the LHS has fewer +# trailing zeros than the RHS. +del_one_restricted = Optimize[ # e.g., "2 x 10", "2 x 100", etc. + delta D[" 1"] "0"+ | + # e.g., "20 x 100", etc. + delta{1,2} D[" 1"] "0" "0"+ | + # e.g., "200" x 1000", etc. + delta{2,3} D[" 1"] "0"{2} "0"+ | + delta{3,4} D[" 1"] "0"{3} "0"+ | + delta{4,5} D[" 1"] "0"{4} "0"+]; + +test2_01 = AssertEqual["2 10" @ del_one_restricted, "20"]; +test2_02 = AssertNull["20 10" @ del_one_restricted]; +test2_03 = AssertEqual["2 100" @ del_one_restricted, "200"]; +test2_04 = AssertEqual["20 100" @ del_one_restricted, "2000"]; +test2_05 = AssertNull[ "200 100" @ del_one_restricted]; +test2_06 = AssertEqual["2 1000" @ del_one_restricted, "2000"]; +test2_07 = AssertEqual["20 1000" @ del_one_restricted, "20000"]; +test2_08 = AssertEqual["200 1000" @ del_one_restricted, "200000"]; +test2_09 = AssertNull["2000 1000" @ del_one_restricted]; +test2_10 = AssertEqual["1000 10000000" @ del_one_restricted, "10000000000"]; + +# Multiplication of vigesimal base for weak vigesimal systems + +vigesimal_times_map = ("1" : "2") | ("2" : "4") | ("3" : "6") | ("4" : "8"); + +del_two = Optimize[vigesimal_times_map D[" 2"] "0"+]; + +test3_1 = AssertEqual["1 20" @ del_two, "20"]; +test3_2 = AssertEqual["2 20" @ del_two, "40"]; +test3_3 = AssertEqual["3 20" @ del_two, "60"]; +test3_4 = AssertEqual["4 20" @ del_two, "80"]; + +# Multiplication of vigesimal base restricted to cases where the LHS is [1-4] +# and the RHS is a power of ten. + +del_two_restricted = Optimize[vigesimal_times_map D[" 2"] "0"+]; + +test4_1 = AssertEqual["1 20" @ del_two_restricted, "20"]; +test4_2 = AssertEqual["2 20" @ del_two_restricted, "40"]; +test4_3 = AssertEqual["3 20" @ del_two_restricted, "60"]; +test4_4 = AssertEqual["4 20" @ del_two_restricted, "80"]; +test4_5 = AssertNull["5 20" @ del_two_restricted]; +test4_6 = AssertNull["10 20" @ del_two_restricted]; + +products = del_one | del_two; +products_restricted = del_one_restricted | del_two_restricted; + +multiplication = CDRewrite[D["(* "] products D[" *)"], "", "", sigmastar]; +multiplication_restricted = CDRewrite[D["(* "] products_restricted D[" *)"], + "", "", sigmastar]; + +test5_1 = AssertEqual["(* 8 100 *)" @ multiplication, "800"]; +test5_2 = AssertEqual["(* 1 100 *)" @ multiplication, "100"]; +test5_3 = AssertEqual["(* 4 20 *)" @ multiplication, "80"]; +test5_4 = AssertEqual["(* 13 1000 *)" @ multiplication, "13000"]; +test5_5 = AssertEqual["(* 13000 10 *)" @ multiplication, "130000"]; +test5_6 = AssertEqual["(* 13000 10 *)" @ multiplication_restricted, + "(* 13000 10 *)"]; # Can't reduce this. + +## Addition. + +insum = "+" (sigma - "(")*; +rcon = insum deltastar; + +# Generic zero deletion up to 12. +del_zero = Optimize[ + # Handles lone zero inside a plus statement. + CDRewrite[D[" 0"], rcon, space_or_rparen, sigmastar] @ + # If we need to go any larger, we probably should switch to a PDT. + CDRewrite[D["0"{12} " "] delta{12}, rcon, space_or_rparen, sigmastar] @ + CDRewrite[D["0"{11} " "] delta{11}, rcon, space_or_rparen, sigmastar] @ + CDRewrite[D["0"{10} " "] delta{10}, rcon, space_or_rparen, sigmastar] @ + CDRewrite[D["0"{9} " "] delta{9}, rcon, space_or_rparen, sigmastar] @ + CDRewrite[D["0"{8} " "] delta{8}, rcon, space_or_rparen, sigmastar] @ + CDRewrite[D["0"{7} " "] delta{7}, rcon, space_or_rparen, sigmastar] @ + CDRewrite[D["0"{6} " "] delta{6}, rcon, space_or_rparen, sigmastar] @ + CDRewrite[D["0"{5} " "] delta{5}, rcon, space_or_rparen, sigmastar] @ + CDRewrite[D["0"{4} " "] delta{4}, rcon, space_or_rparen, sigmastar] @ + CDRewrite[D["0"{3} " "] delta{3}, rcon, space_or_rparen, sigmastar] @ + CDRewrite[D["0"{2} " "] delta{2}, rcon, space_or_rparen, sigmastar] @ + CDRewrite[D["0" " "] delta, rcon, space_or_rparen, sigmastar]]; + +## Weak vigesimal cases involving scores and teens. + +vigesimal_plus_map = Optimize[("20 1" : "3") delta | + ("40 1" : "5") delta | + ("60 1" : "7") delta | + ("80 1" : "9") delta]; + +vigesimal = CDRewrite[vigesimal_plus_map, insum, space_or_rparen, sigmastar]; + +## Germanic decade flop. + +germanic_map = StringFile['util/germanic.tsv']; + +germanic = CDRewrite[germanic_map, insum, space_or_rparen, sigmastar]; + +sums = Optimize[germanic @ vigesimal @ del_zero]; + +# Deletes the surrounding "(+ +)" around a successful reduction. + +del_plus = CDRewrite[D["(+ "] delta+ D[" +)"], "", "", sigmastar]; + +addition = Optimize[sums @ del_plus]; + +test6_1 = AssertEqual["(+ 30 2 +)" @ addition, "32"]; +test6_2 = AssertEqual["(+ 300 20 1 +)" @ addition, "321"]; +test6_3 = AssertEqual["(+ 80 17 +)" @ addition, "97"]; +test6_4 = AssertEqual["(+ 4 50 +)" @ addition, "54"]; +test6_5 = AssertEqual["(+ 3000 80 17 +)" @ addition, "3097"]; +test6_6 = AssertEqual["(+ 3000 4 50 +)" @ addition, "3054"]; +test6_7 = AssertEqual["(+ 0 10 +)" @ addition, "10"]; +test6_8 = AssertEqual["(+ 0 20 +)" @ addition, "20"]; +test6_9 = AssertEqual["(+ 200 (+ 0 20 +) +)" @ addition @ addition, "220"]; + +## Export statements. + +export ARITHMETIC = Optimize[multiplication @ addition]; +export ARITHMETIC_RESTRICTED = Optimize[multiplication_restricted @ addition]; + +# Lightweight versions that lack the vigesimal /vɪˈdʒɛsɪməl/ or Germanic decade +# flop, or both. + +export ARITHMETIC_BASIC = Optimize[multiplication @ del_zero @ del_plus]; +export ARITHMETIC_BASIC_RESTRICTED = Optimize[multiplication_restricted @ + del_zero @ del_plus]; + +export ARITHMETIC_GERMANIC = Optimize[multiplication @ germanic @ del_zero @ + del_plus]; + +export ARITHMETIC_GERMANIC_RESTRICTED = Optimize[multiplication_restricted @ + germanic @ del_zero @ + del_plus]; + +export ARITHMETIC_VIGESIMAL = Optimize[multiplication @ vigesimal @ del_zero @ + del_plus]; +export ARITHMETIC_VIGESIMAL_RESTRICTED = Optimize[multiplication_restricted @ + vigesimal @ del_zero @ + del_plus]; + +## LEAVES transducer. + +nonterm = "+" | "*"; +export LEAVES = Optimize[CDRewrite["(" nonterm " " | " " nonterm ")" : "", + "", "", sigmastar]]; + +test7 = AssertEqual["(* (+ (* 4 20 *) 10 7 +) 1000 *)" @ LEAVES, + "4 20 10 7 1000"]; + +## Optional filter for repeated large powers of ten, to be applied to leaves. + +func Filter[expr, sigstar] { + return Optimize[sigstar - (sigstar expr sigstar)]; +} + +func FilterMoreThanOne[expr, sigstar] { + return Filter[expr " " (sigstar " ")? expr, sigstar]; +} + +filter_sigstar = (delta | " ")*; + +export REPEAT_FILTER = + Optimize[FilterMoreThanOne["1000", filter_sigstar] @ + FilterMoreThanOne["10000", filter_sigstar] @ + FilterMoreThanOne["100000", filter_sigstar] @ + FilterMoreThanOne["1000000", filter_sigstar] @ + FilterMoreThanOne["1000000000", filter_sigstar] @ + FilterMoreThanOne["1000000000000", filter_sigstar]]; + +test8_1 = AssertNull["50 1000 4 1000" @ REPEAT_FILTER]; +test8_2 = AssertNull["50 1000000 4 1000000" @ REPEAT_FILTER]; +test8_3 = AssertEqual["50 100 1000" @ REPEAT_FILTER, "50 100 1000"]; +test8_4 = AssertNull["20 1000 1000 20" @ REPEAT_FILTER]; +test8_5 = AssertEqual[ + "70 1000000 400 0 70 0 7 1000 100 0 70" @ REPEAT_FILTER, + "70 1000000 400 0 70 0 7 1000 100 0 70" @ REPEAT_FILTER]; +test8_6 = AssertNull[ + "70 1000000 400 0 70 1000 0 7 1000 100 0 70" @ REPEAT_FILTER]; + +# Filters to force the output of *inverting* the arithmetic as applied to a +# digit string to be a well-formed sexpr: + +not_space = b.kNotSpace; + +# Things like (+ 1 +)(+ 9 +). + +bad_parens = + sigmastar ")" not_space sigmastar + | sigmastar not_space "(" sigmastar +; + +no_bad_parens = sigmastar - bad_parens; + +# Things like (+ 1 +) or (* 3 *). + +spurious_operators = + sigmastar "(+ " delta+ " +)" sigmastar + | sigmastar "(* " delta+ " *)" sigmastar +; + +no_spurious_operators = sigmastar - spurious_operators; + +no_strings_of_zeros = + sigmastar - (sigmastar " " "0"+ " " "0"+ " " sigmastar) +; + +no_bad_sequences = + Optimize[no_bad_parens @ no_strings_of_zeros] +; + +export SEXP_FILTER = Optimize[ + ( delta+ + | "(* " no_bad_sequences " *)" + | "(+ " no_bad_sequences " +)") @ no_spurious_operators] +; + +# For convenience adds inverses of the arithmetic rules: + +export IARITHMETIC = Invert[ARITHMETIC]; + +export IARITHMETIC_RESTRICTED = Invert[ARITHMETIC_RESTRICTED]; + +export IARITHMETIC_BASIC = Invert[ARITHMETIC_BASIC]; + +export IARITHMETIC_BASIC_RESTRICTED = Invert[ARITHMETIC_BASIC_RESTRICTED]; + +export IARITHMETIC_GERMANIC = Invert[ARITHMETIC_GERMANIC]; + +export IARITHMETIC_GERMANIC_RESTRICTED = + Invert[ARITHMETIC_GERMANIC_RESTRICTED] +; + +export IARITHMETIC_VIGESIMAL = Invert[ARITHMETIC_VIGESIMAL]; + +export IARITHMETIC_VIGESIMAL_RESTRICTED = + Invert[ARITHMETIC_VIGESIMAL_RESTRICTED] +; + +## This should be applied on the lefthand side of FG to ensure that the only +## digit input nis permitted. +export DELTA_STAR = deltastar; diff --git a/third_party/chinese_text_normalization/thrax/src/util/byte.grm b/third_party/chinese_text_normalization/thrax/src/util/byte.grm new file mode 100644 index 000000000..32e6ead75 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/util/byte.grm @@ -0,0 +1,75 @@ +# Copyright 2017 Google Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +# Standard constants for ASCII (byte) based strings. This mirrors the +# functions provided by C/C++'s ctype.h library. + +# Note that [0] is missing; matching the string-termination character is kinda weird. +export kBytes = Optimize[ + "[1]" | "[2]" | "[3]" | "[4]" | "[5]" | "[6]" | "[7]" | "[8]" | "[9]" | "[10]" | + "[11]" | "[12]" | "[13]" | "[14]" | "[15]" | "[16]" | "[17]" | "[18]" | "[19]" | "[20]" | + "[21]" | "[22]" | "[23]" | "[24]" | "[25]" | "[26]" | "[27]" | "[28]" | "[29]" | "[30]" | + "[31]" | "[32]" | "[33]" | "[34]" | "[35]" | "[36]" | "[37]" | "[38]" | "[39]" | "[40]" | + "[41]" | "[42]" | "[43]" | "[44]" | "[45]" | "[46]" | "[47]" | "[48]" | "[49]" | "[50]" | + "[51]" | "[52]" | "[53]" | "[54]" | "[55]" | "[56]" | "[57]" | "[58]" | "[59]" | "[60]" | + "[61]" | "[62]" | "[63]" | "[64]" | "[65]" | "[66]" | "[67]" | "[68]" | "[69]" | "[70]" | + "[71]" | "[72]" | "[73]" | "[74]" | "[75]" | "[76]" | "[77]" | "[78]" | "[79]" | "[80]" | + "[81]" | "[82]" | "[83]" | "[84]" | "[85]" | "[86]" | "[87]" | "[88]" | "[89]" | "[90]" | + "[91]" | "[92]" | "[93]" | "[94]" | "[95]" | "[96]" | "[97]" | "[98]" | "[99]" | "[100]" | +"[101]" | "[102]" | "[103]" | "[104]" | "[105]" | "[106]" | "[107]" | "[108]" | "[109]" | "[110]" | +"[111]" | "[112]" | "[113]" | "[114]" | "[115]" | "[116]" | "[117]" | "[118]" | "[119]" | "[120]" | +"[121]" | "[122]" | "[123]" | "[124]" | "[125]" | "[126]" | "[127]" | "[128]" | "[129]" | "[130]" | +"[131]" | "[132]" | "[133]" | "[134]" | "[135]" | "[136]" | "[137]" | "[138]" | "[139]" | "[140]" | +"[141]" | "[142]" | "[143]" | "[144]" | "[145]" | "[146]" | "[147]" | "[148]" | "[149]" | "[150]" | +"[151]" | "[152]" | "[153]" | "[154]" | "[155]" | "[156]" | "[157]" | "[158]" | "[159]" | "[160]" | +"[161]" | "[162]" | "[163]" | "[164]" | "[165]" | "[166]" | "[167]" | "[168]" | "[169]" | "[170]" | +"[171]" | "[172]" | "[173]" | "[174]" | "[175]" | "[176]" | "[177]" | "[178]" | "[179]" | "[180]" | +"[181]" | "[182]" | "[183]" | "[184]" | "[185]" | "[186]" | "[187]" | "[188]" | "[189]" | "[190]" | +"[191]" | "[192]" | "[193]" | "[194]" | "[195]" | "[196]" | "[197]" | "[198]" | "[199]" | "[200]" | +"[201]" | "[202]" | "[203]" | "[204]" | "[205]" | "[206]" | "[207]" | "[208]" | "[209]" | "[210]" | +"[211]" | "[212]" | "[213]" | "[214]" | "[215]" | "[216]" | "[217]" | "[218]" | "[219]" | "[220]" | +"[221]" | "[222]" | "[223]" | "[224]" | "[225]" | "[226]" | "[227]" | "[228]" | "[229]" | "[230]" | +"[231]" | "[232]" | "[233]" | "[234]" | "[235]" | "[236]" | "[237]" | "[238]" | "[239]" | "[240]" | +"[241]" | "[242]" | "[243]" | "[244]" | "[245]" | "[246]" | "[247]" | "[248]" | "[249]" | "[250]" | +"[251]" | "[252]" | "[253]" | "[254]" | "[255]" +]; + +export kDigit = Optimize[ + "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" +]; + +export kLower = Optimize[ + "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" | "j" | "k" | "l" | "m" | + "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z" +]; +export kUpper = Optimize[ + "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K" | "L" | "M" | + "N" | "O" | "P" | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" +]; +export kAlpha = Optimize[kLower | kUpper]; + +export kAlnum = Optimize[kDigit | kAlpha]; + +export kSpace = Optimize[ + " " | "\t" | "\n" | "\r" +]; +export kNotSpace = Optimize[kBytes - kSpace]; + +export kPunct = Optimize[ + "!" | "\"" | "#" | "$" | "%" | "&" | "'" | "(" | ")" | "*" | "+" | "," | + "-" | "." | "/" | ":" | ";" | "<" | "=" | ">" | "?" | "@" | "\[" | "\\" | + "\]" | "^" | "_" | "`" | "{" | "|" | "}" | "~" +]; + +export kGraph = Optimize[kAlnum | kPunct]; diff --git a/third_party/chinese_text_normalization/thrax/src/util/case.grm b/third_party/chinese_text_normalization/thrax/src/util/case.grm new file mode 100644 index 000000000..ff10354b7 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/util/case.grm @@ -0,0 +1,3383 @@ +# Copyright 2017 Google Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +# Case-conversion functions. + +import 'util/byte.grm' as bytelib; + +export UPPER = + "A" + | "B" + | "C" + | "D" + | "E" + | "F" + | "G" + | "H" + | "I" + | "J" + | "K" + | "L" + | "M" + | "N" + | "O" + | "P" + | "Q" + | "R" + | "S" + | "T" + | "U" + | "V" + | "W" + | "X" + | "Y" + | "Z" + | "À" + | "Á" + | "Â" + | "Ã" + | "Ä" + | "Å" + | "Æ" + | "Ç" + | "È" + | "É" + | "Ê" + | "Ë" + | "Ì" + | "Í" + | "Î" + | "Ï" + | "Ð" + | "Ñ" + | "Ò" + | "Ó" + | "Ô" + | "Õ" + | "Ö" + | "Ø" + | "Ù" + | "Ú" + | "Û" + | "Ü" + | "Ý" + | "Þ" + | "Ā" + | "Ă" + | "Ą" + | "Ć" + | "Ĉ" + | "Ċ" + | "Č" + | "Ď" + | "Đ" + | "Ē" + | "Ĕ" + | "Ė" + | "Ę" + | "Ě" + | "Ĝ" + | "Ğ" + | "Ġ" + | "Ģ" + | "Ĥ" + | "Ħ" + | "Ĩ" + | "Ī" + | "Ĭ" + | "Į" + | "İ" + | "IJ" + | "Ĵ" + | "Ķ" + | "Ĺ" + | "Ļ" + | "Ľ" + | "Ŀ" + | "Ł" + | "Ń" + | "Ņ" + | "Ň" + | "Ŋ" + | "Ō" + | "Ŏ" + | "Ő" + | "Œ" + | "Ŕ" + | "Ŗ" + | "Ř" + | "Ś" + | "Ŝ" + | "Ş" + | "Š" + | "Ţ" + | "Ť" + | "Ŧ" + | "Ũ" + | "Ū" + | "Ŭ" + | "Ů" + | "Ű" + | "Ų" + | "Ŵ" + | "Ŷ" + | "Ÿ" + | "Ź" + | "Ż" + | "Ž" + | "Ɓ" + | "Ƃ" + | "Ƅ" + | "Ɔ" + | "Ƈ" + | "Ɖ" + | "Ɗ" + | "Ƌ" + | "Ǝ" + | "Ə" + | "Ɛ" + | "Ƒ" + | "Ɠ" + | "Ɣ" + | "Ɩ" + | "Ɨ" + | "Ƙ" + | "Ɯ" + | "Ɲ" + | "Ɵ" + | "Ơ" + | "Ƣ" + | "Ƥ" + | "Ƨ" + | "Ʃ" + | "Ƭ" + | "Ʈ" + | "Ư" + | "Ʊ" + | "Ʋ" + | "Ƴ" + | "Ƶ" + | "Ʒ" + | "Ƹ" + | "Ƽ" + | "DŽ" + | "Dž" + | "LJ" + | "Lj" + | "NJ" + | "Nj" + | "Ǎ" + | "Ǐ" + | "Ǒ" + | "Ǔ" + | "Ǖ" + | "Ǘ" + | "Ǚ" + | "Ǜ" + | "Ǟ" + | "Ǡ" + | "Ǣ" + | "Ǥ" + | "Ǧ" + | "Ǩ" + | "Ǫ" + | "Ǭ" + | "Ǯ" + | "DZ" + | "Dz" + | "Ǵ" + | "Ƕ" + | "Ƿ" + | "Ǹ" + | "Ǻ" + | "Ǽ" + | "Ǿ" + | "Ȁ" + | "Ȃ" + | "Ȅ" + | "Ȇ" + | "Ȉ" + | "Ȋ" + | "Ȍ" + | "Ȏ" + | "Ȑ" + | "Ȓ" + | "Ȕ" + | "Ȗ" + | "Ș" + | "Ț" + | "Ȝ" + | "Ȟ" + | "Ƞ" + | "Ȣ" + | "Ȥ" + | "Ȧ" + | "Ȩ" + | "Ȫ" + | "Ȭ" + | "Ȯ" + | "Ȱ" + | "Ȳ" + | "Ȼ" + | "Ƚ" + | "Ɂ" + | "Ά" + | "Έ" + | "Ή" + | "Ί" + | "Ό" + | "Ύ" + | "Ώ" + | "Α" + | "Β" + | "Γ" + | "Δ" + | "Ε" + | "Ζ" + | "Η" + | "Θ" + | "Ι" + | "Κ" + | "Λ" + | "Μ" + | "Ν" + | "Ξ" + | "Ο" + | "Π" + | "Ρ" + | "Σ" + | "Τ" + | "Υ" + | "Φ" + | "Χ" + | "Ψ" + | "Ω" + | "Ϊ" + | "Ϋ" + | "Ϣ" + | "Ϥ" + | "Ϧ" + | "Ϩ" + | "Ϫ" + | "Ϭ" + | "Ϯ" + | "ϴ" + | "Ϸ" + | "Ϲ" + | "Ϻ" + | "Ѐ" + | "Ё" + | "Ђ" + | "Ѓ" + | "Є" + | "Ѕ" + | "І" + | "Ї" + | "Ј" + | "Љ" + | "Њ" + | "Ћ" + | "Ќ" + | "Ѝ" + | "Ў" + | "Џ" + | "А" + | "Б" + | "В" + | "Г" + | "Д" + | "Е" + | "Ж" + | "З" + | "И" + | "Й" + | "К" + | "Л" + | "М" + | "Н" + | "О" + | "П" + | "Р" + | "С" + | "Т" + | "У" + | "Ф" + | "Х" + | "Ц" + | "Ч" + | "Ш" + | "Щ" + | "Ъ" + | "Ы" + | "Ь" + | "Э" + | "Ю" + | "Я" + | "Ѡ" + | "Ѣ" + | "Ѥ" + | "Ѧ" + | "Ѩ" + | "Ѫ" + | "Ѭ" + | "Ѯ" + | "Ѱ" + | "Ѳ" + | "Ѵ" + | "Ѷ" + | "Ѹ" + | "Ѻ" + | "Ѽ" + | "Ѿ" + | "Ҁ" + | "Ҋ" + | "Ҍ" + | "Ҏ" + | "Ґ" + | "Ғ" + | "Ҕ" + | "Җ" + | "Ҙ" + | "Қ" + | "Ҝ" + | "Ҟ" + | "Ҡ" + | "Ң" + | "Ҥ" + | "Ҧ" + | "Ҩ" + | "Ҫ" + | "Ҭ" + | "Ү" + | "Ұ" + | "Ҳ" + | "Ҵ" + | "Ҷ" + | "Ҹ" + | "Һ" + | "Ҽ" + | "Ҿ" + | "Ӂ" + | "Ӄ" + | "Ӆ" + | "Ӈ" + | "Ӊ" + | "Ӌ" + | "Ӎ" + | "Ӑ" + | "Ӓ" + | "Ӕ" + | "Ӗ" + | "Ә" + | "Ӛ" + | "Ӝ" + | "Ӟ" + | "Ӡ" + | "Ӣ" + | "Ӥ" + | "Ӧ" + | "Ө" + | "Ӫ" + | "Ӭ" + | "Ӯ" + | "Ӱ" + | "Ӳ" + | "Ӵ" + | "Ӷ" + | "Ӹ" + | "Ԁ" + | "Ԃ" + | "Ԅ" + | "Ԇ" + | "Ԉ" + | "Ԋ" + | "Ԍ" + | "Ԏ" + | "Ա" + | "Բ" + | "Գ" + | "Դ" + | "Ե" + | "Զ" + | "Է" + | "Ը" + | "Թ" + | "Ժ" + | "Ի" + | "Լ" + | "Խ" + | "Ծ" + | "Կ" + | "Հ" + | "Ձ" + | "Ղ" + | "Ճ" + | "Մ" + | "Յ" + | "Ն" + | "Շ" + | "Ո" + | "Չ" + | "Պ" + | "Ջ" + | "Ռ" + | "Ս" + | "Վ" + | "Տ" + | "Ր" + | "Ց" + | "Ւ" + | "Փ" + | "Ք" + | "Օ" + | "Ֆ" + | "Ⴀ" + | "Ⴁ" + | "Ⴂ" + | "Ⴃ" + | "Ⴄ" + | "Ⴅ" + | "Ⴆ" + | "Ⴇ" + | "Ⴈ" + | "Ⴉ" + | "Ⴊ" + | "Ⴋ" + | "Ⴌ" + | "Ⴍ" + | "Ⴎ" + | "Ⴏ" + | "Ⴐ" + | "Ⴑ" + | "Ⴒ" + | "Ⴓ" + | "Ⴔ" + | "Ⴕ" + | "Ⴖ" + | "Ⴗ" + | "Ⴘ" + | "Ⴙ" + | "Ⴚ" + | "Ⴛ" + | "Ⴜ" + | "Ⴝ" + | "Ⴞ" + | "Ⴟ" + | "Ⴠ" + | "Ⴡ" + | "Ⴢ" + | "Ⴣ" + | "Ⴤ" + | "Ⴥ" + | "Ḁ" + | "Ḃ" + | "Ḅ" + | "Ḇ" + | "Ḉ" + | "Ḋ" + | "Ḍ" + | "Ḏ" + | "Ḑ" + | "Ḓ" + | "Ḕ" + | "Ḗ" + | "Ḙ" + | "Ḛ" + | "Ḝ" + | "Ḟ" + | "Ḡ" + | "Ḣ" + | "Ḥ" + | "Ḧ" + | "Ḩ" + | "Ḫ" + | "Ḭ" + | "Ḯ" + | "Ḱ" + | "Ḳ" + | "Ḵ" + | "Ḷ" + | "Ḹ" + | "Ḻ" + | "Ḽ" + | "Ḿ" + | "Ṁ" + | "Ṃ" + | "Ṅ" + | "Ṇ" + | "Ṉ" + | "Ṋ" + | "Ṍ" + | "Ṏ" + | "Ṑ" + | "Ṓ" + | "Ṕ" + | "Ṗ" + | "Ṙ" + | "Ṛ" + | "Ṝ" + | "Ṟ" + | "Ṡ" + | "Ṣ" + | "Ṥ" + | "Ṧ" + | "Ṩ" + | "Ṫ" + | "Ṭ" + | "Ṯ" + | "Ṱ" + | "Ṳ" + | "Ṵ" + | "Ṷ" + | "Ṹ" + | "Ṻ" + | "Ṽ" + | "Ṿ" + | "Ẁ" + | "Ẃ" + | "Ẅ" + | "Ẇ" + | "Ẉ" + | "Ẋ" + | "Ẍ" + | "Ẏ" + | "Ẑ" + | "Ẓ" + | "Ẕ" + | "Ạ" + | "Ả" + | "Ấ" + | "Ầ" + | "Ẩ" + | "Ẫ" + | "Ậ" + | "Ắ" + | "Ằ" + | "Ẳ" + | "Ẵ" + | "Ặ" + | "Ẹ" + | "Ẻ" + | "Ẽ" + | "Ế" + | "Ề" + | "Ể" + | "Ễ" + | "Ệ" + | "Ỉ" + | "Ị" + | "Ọ" + | "Ỏ" + | "Ố" + | "Ồ" + | "Ổ" + | "Ỗ" + | "Ộ" + | "Ớ" + | "Ờ" + | "Ở" + | "Ỡ" + | "Ợ" + | "Ụ" + | "Ủ" + | "Ứ" + | "Ừ" + | "Ử" + | "Ữ" + | "Ự" + | "Ỳ" + | "Ỵ" + | "Ỷ" + | "Ỹ" + | "Ἀ" + | "Ἁ" + | "Ἂ" + | "Ἃ" + | "Ἄ" + | "Ἅ" + | "Ἆ" + | "Ἇ" + | "Ἐ" + | "Ἑ" + | "Ἒ" + | "Ἓ" + | "Ἔ" + | "Ἕ" + | "Ἠ" + | "Ἡ" + | "Ἢ" + | "Ἣ" + | "Ἤ" + | "Ἥ" + | "Ἦ" + | "Ἧ" + | "Ἰ" + | "Ἱ" + | "Ἲ" + | "Ἳ" + | "Ἴ" + | "Ἵ" + | "Ἶ" + | "Ἷ" + | "Ὀ" + | "Ὁ" + | "Ὂ" + | "Ὃ" + | "Ὄ" + | "Ὅ" + | "Ὑ" + | "Ὓ" + | "Ὕ" + | "Ὗ" + | "Ὠ" + | "Ὡ" + | "Ὢ" + | "Ὣ" + | "Ὤ" + | "Ὥ" + | "Ὦ" + | "Ὧ" + | "ᾈ" + | "ᾉ" + | "ᾊ" + | "ᾋ" + | "ᾌ" + | "ᾍ" + | "ᾎ" + | "ᾏ" + | "ᾘ" + | "ᾙ" + | "ᾚ" + | "ᾛ" + | "ᾜ" + | "ᾝ" + | "ᾞ" + | "ᾟ" + | "ᾨ" + | "ᾩ" + | "ᾪ" + | "ᾫ" + | "ᾬ" + | "ᾭ" + | "ᾮ" + | "ᾯ" + | "Ᾰ" + | "Ᾱ" + | "Ὰ" + | "Ά" + | "ᾼ" + | "Ὲ" + | "Έ" + | "Ὴ" + | "Ή" + | "ῌ" + | "Ῐ" + | "Ῑ" + | "Ὶ" + | "Ί" + | "Ῠ" + | "Ῡ" + | "Ὺ" + | "Ύ" + | "Ῥ" + | "Ὸ" + | "Ό" + | "Ὼ" + | "Ώ" + | "ῼ" + | "Ⓐ" + | "Ⓑ" + | "Ⓒ" + | "Ⓓ" + | "Ⓔ" + | "Ⓕ" + | "Ⓖ" + | "Ⓗ" + | "Ⓘ" + | "Ⓙ" + | "Ⓚ" + | "Ⓛ" + | "Ⓜ" + | "Ⓝ" + | "Ⓞ" + | "Ⓟ" + | "Ⓠ" + | "Ⓡ" + | "Ⓢ" + | "Ⓣ" + | "Ⓤ" + | "Ⓥ" + | "Ⓦ" + | "Ⓧ" + | "Ⓨ" + | "Ⓩ" + | "Ⰰ" + | "Ⰱ" + | "Ⰲ" + | "Ⰳ" + | "Ⰴ" + | "Ⰵ" + | "Ⰶ" + | "Ⰷ" + | "Ⰸ" + | "Ⰹ" + | "Ⰺ" + | "Ⰻ" + | "Ⰼ" + | "Ⰽ" + | "Ⰾ" + | "Ⰿ" + | "Ⱀ" + | "Ⱁ" + | "Ⱂ" + | "Ⱃ" + | "Ⱄ" + | "Ⱅ" + | "Ⱆ" + | "Ⱇ" + | "Ⱈ" + | "Ⱉ" + | "Ⱊ" + | "Ⱋ" + | "Ⱌ" + | "Ⱍ" + | "Ⱎ" + | "Ⱏ" + | "Ⱐ" + | "Ⱑ" + | "Ⱒ" + | "Ⱓ" + | "Ⱔ" + | "Ⱕ" + | "Ⱖ" + | "Ⱗ" + | "Ⱘ" + | "Ⱙ" + | "Ⱚ" + | "Ⱛ" + | "Ⱜ" + | "Ⱝ" + | "Ⱞ" + | "Ⲁ" + | "Ⲃ" + | "Ⲅ" + | "Ⲇ" + | "Ⲉ" + | "Ⲋ" + | "Ⲍ" + | "Ⲏ" + | "Ⲑ" + | "Ⲓ" + | "Ⲕ" + | "Ⲗ" + | "Ⲙ" + | "Ⲛ" + | "Ⲝ" + | "Ⲟ" + | "Ⲡ" + | "Ⲣ" + | "Ⲥ" + | "Ⲧ" + | "Ⲩ" + | "Ⲫ" + | "Ⲭ" + | "Ⲯ" + | "Ⲱ" + | "Ⲳ" + | "Ⲵ" + | "Ⲷ" + | "Ⲹ" + | "Ⲻ" + | "Ⲽ" + | "Ⲿ" + | "Ⳁ" + | "Ⳃ" + | "Ⳅ" + | "Ⳇ" + | "Ⳉ" + | "Ⳋ" + | "Ⳍ" + | "Ⳏ" + | "Ⳑ" + | "Ⳓ" + | "Ⳕ" + | "Ⳗ" + | "Ⳙ" + | "Ⳛ" + | "Ⳝ" + | "Ⳟ" + | "Ⳡ" + | "Ⳣ" + | "A" + | "B" + | "C" + | "D" + | "E" + | "F" + | "G" + | "H" + | "I" + | "J" + | "K" + | "L" + | "M" + | "N" + | "O" + | "P" + | "Q" + | "R" + | "S" + | "T" + | "U" + | "V" + | "W" + | "X" + | "Y" + | "Z" +; + +export LOWER = + "a" + | "b" + | "c" + | "d" + | "e" + | "f" + | "g" + | "h" + | "i" + | "j" + | "k" + | "l" + | "m" + | "n" + | "o" + | "p" + | "q" + | "r" + | "s" + | "t" + | "u" + | "v" + | "w" + | "x" + | "y" + | "z" + | "à" + | "á" + | "â" + | "ã" + | "ä" + | "å" + | "æ" + | "ç" + | "è" + | "é" + | "ê" + | "ë" + | "ì" + | "í" + | "î" + | "ï" + | "ð" + | "ñ" + | "ò" + | "ó" + | "ô" + | "õ" + | "ö" + | "ø" + | "ù" + | "ú" + | "û" + | "ü" + | "ý" + | "þ" + | "ā" + | "ă" + | "ą" + | "ć" + | "ĉ" + | "ċ" + | "č" + | "ď" + | "đ" + | "ē" + | "ĕ" + | "ė" + | "ę" + | "ě" + | "ĝ" + | "ğ" + | "ġ" + | "ģ" + | "ĥ" + | "ħ" + | "ĩ" + | "ī" + | "ĭ" + | "į" + | "i" + | "ij" + | "ĵ" + | "ķ" + | "ĺ" + | "ļ" + | "ľ" + | "ŀ" + | "ł" + | "ń" + | "ņ" + | "ň" + | "ŋ" + | "ō" + | "ŏ" + | "ő" + | "œ" + | "ŕ" + | "ŗ" + | "ř" + | "ś" + | "ŝ" + | "ş" + | "ß" + | "š" + | "ţ" + | "ť" + | "ŧ" + | "ũ" + | "ū" + | "ŭ" + | "ů" + | "ű" + | "ų" + | "ŵ" + | "ŷ" + | "ÿ" + | "ź" + | "ż" + | "ž" + | "ɓ" + | "ƃ" + | "ƅ" + | "ɔ" + | "ƈ" + | "ɖ" + | "ɗ" + | "ƌ" + | "ǝ" + | "ə" + | "ɛ" + | "ƒ" + | "ɠ" + | "ɣ" + | "ɩ" + | "ɨ" + | "ƙ" + | "ɯ" + | "ɲ" + | "ɵ" + | "ơ" + | "ƣ" + | "ƥ" + | "ƨ" + | "ʃ" + | "ƭ" + | "ʈ" + | "ư" + | "ʊ" + | "ʋ" + | "ƴ" + | "ƶ" + | "ʒ" + | "ƹ" + | "ƽ" + | "dž" + | "dž" + | "lj" + | "lj" + | "nj" + | "nj" + | "ǎ" + | "ǐ" + | "ǒ" + | "ǔ" + | "ǖ" + | "ǘ" + | "ǚ" + | "ǜ" + | "ǟ" + | "ǡ" + | "ǣ" + | "ǥ" + | "ǧ" + | "ǩ" + | "ǫ" + | "ǭ" + | "ǯ" + | "dz" + | "dz" + | "ǵ" + | "ƕ" + | "ƿ" + | "ǹ" + | "ǻ" + | "ǽ" + | "ǿ" + | "ȁ" + | "ȃ" + | "ȅ" + | "ȇ" + | "ȉ" + | "ȋ" + | "ȍ" + | "ȏ" + | "ȑ" + | "ȓ" + | "ȕ" + | "ȗ" + | "ș" + | "ț" + | "ȝ" + | "ȟ" + | "ƞ" + | "ȣ" + | "ȥ" + | "ȧ" + | "ȩ" + | "ȫ" + | "ȭ" + | "ȯ" + | "ȱ" + | "ȳ" + | "ȼ" + | "ƚ" + | "ʔ" + | "ά" + | "έ" + | "ή" + | "ί" + | "ό" + | "ύ" + | "ώ" + | "α" + | "β" + | "γ" + | "δ" + | "ε" + | "ζ" + | "η" + | "θ" + | "ι" + | "κ" + | "λ" + | "μ" + | "ν" + | "ξ" + | "ο" + | "π" + | "ρ" + | "σ" + | "ς" + | "τ" + | "υ" + | "φ" + | "χ" + | "ψ" + | "ω" + | "ϊ" + | "ϋ" + | "ϣ" + | "ϥ" + | "ϧ" + | "ϩ" + | "ϫ" + | "ϭ" + | "ϯ" + | "θ" + | "ϸ" + | "ϲ" + | "ϻ" + | "ѐ" + | "ё" + | "ђ" + | "ѓ" + | "є" + | "ѕ" + | "і" + | "ї" + | "ј" + | "љ" + | "њ" + | "ћ" + | "ќ" + | "ѝ" + | "ў" + | "џ" + | "а" + | "б" + | "в" + | "г" + | "д" + | "е" + | "ж" + | "з" + | "и" + | "й" + | "к" + | "л" + | "м" + | "н" + | "о" + | "п" + | "р" + | "с" + | "т" + | "у" + | "ф" + | "х" + | "ц" + | "ч" + | "ш" + | "щ" + | "ъ" + | "ы" + | "ь" + | "э" + | "ю" + | "я" + | "ѡ" + | "ѣ" + | "ѥ" + | "ѧ" + | "ѩ" + | "ѫ" + | "ѭ" + | "ѯ" + | "ѱ" + | "ѳ" + | "ѵ" + | "ѷ" + | "ѹ" + | "ѻ" + | "ѽ" + | "ѿ" + | "ҁ" + | "ҋ" + | "ҍ" + | "ҏ" + | "ґ" + | "ғ" + | "ҕ" + | "җ" + | "ҙ" + | "қ" + | "ҝ" + | "ҟ" + | "ҡ" + | "ң" + | "ҥ" + | "ҧ" + | "ҩ" + | "ҫ" + | "ҭ" + | "ү" + | "ұ" + | "ҳ" + | "ҵ" + | "ҷ" + | "ҹ" + | "һ" + | "ҽ" + | "ҿ" + | "ӂ" + | "ӄ" + | "ӆ" + | "ӈ" + | "ӊ" + | "ӌ" + | "ӎ" + | "ӑ" + | "ӓ" + | "ӕ" + | "ӗ" + | "ә" + | "ӛ" + | "ӝ" + | "ӟ" + | "ӡ" + | "ӣ" + | "ӥ" + | "ӧ" + | "ө" + | "ӫ" + | "ӭ" + | "ӯ" + | "ӱ" + | "ӳ" + | "ӵ" + | "ӷ" + | "ӹ" + | "ԁ" + | "ԃ" + | "ԅ" + | "ԇ" + | "ԉ" + | "ԋ" + | "ԍ" + | "ԏ" + | "ա" + | "բ" + | "գ" + | "դ" + | "ե" + | "զ" + | "է" + | "ը" + | "թ" + | "ժ" + | "ի" + | "լ" + | "խ" + | "ծ" + | "կ" + | "հ" + | "ձ" + | "ղ" + | "ճ" + | "մ" + | "յ" + | "ն" + | "շ" + | "ո" + | "չ" + | "պ" + | "ջ" + | "ռ" + | "ս" + | "վ" + | "տ" + | "ր" + | "ց" + | "ւ" + | "փ" + | "ք" + | "օ" + | "ֆ" + | "ⴀ" + | "ⴁ" + | "ⴂ" + | "ⴃ" + | "ⴄ" + | "ⴅ" + | "ⴆ" + | "ⴇ" + | "ⴈ" + | "ⴉ" + | "ⴊ" + | "ⴋ" + | "ⴌ" + | "ⴍ" + | "ⴎ" + | "ⴏ" + | "ⴐ" + | "ⴑ" + | "ⴒ" + | "ⴓ" + | "ⴔ" + | "ⴕ" + | "ⴖ" + | "ⴗ" + | "ⴘ" + | "ⴙ" + | "ⴚ" + | "ⴛ" + | "ⴜ" + | "ⴝ" + | "ⴞ" + | "ⴟ" + | "ⴠ" + | "ⴡ" + | "ⴢ" + | "ⴣ" + | "ⴤ" + | "ⴥ" + | "ḁ" + | "ḃ" + | "ḅ" + | "ḇ" + | "ḉ" + | "ḋ" + | "ḍ" + | "ḏ" + | "ḑ" + | "ḓ" + | "ḕ" + | "ḗ" + | "ḙ" + | "ḛ" + | "ḝ" + | "ḟ" + | "ḡ" + | "ḣ" + | "ḥ" + | "ḧ" + | "ḩ" + | "ḫ" + | "ḭ" + | "ḯ" + | "ḱ" + | "ḳ" + | "ḵ" + | "ḷ" + | "ḹ" + | "ḻ" + | "ḽ" + | "ḿ" + | "ṁ" + | "ṃ" + | "ṅ" + | "ṇ" + | "ṉ" + | "ṋ" + | "ṍ" + | "ṏ" + | "ṑ" + | "ṓ" + | "ṕ" + | "ṗ" + | "ṙ" + | "ṛ" + | "ṝ" + | "ṟ" + | "ṡ" + | "ṣ" + | "ṥ" + | "ṧ" + | "ṩ" + | "ṫ" + | "ṭ" + | "ṯ" + | "ṱ" + | "ṳ" + | "ṵ" + | "ṷ" + | "ṹ" + | "ṻ" + | "ṽ" + | "ṿ" + | "ẁ" + | "ẃ" + | "ẅ" + | "ẇ" + | "ẉ" + | "ẋ" + | "ẍ" + | "ẏ" + | "ẑ" + | "ẓ" + | "ẕ" + | "ạ" + | "ả" + | "ấ" + | "ầ" + | "ẩ" + | "ẫ" + | "ậ" + | "ắ" + | "ằ" + | "ẳ" + | "ẵ" + | "ặ" + | "ẹ" + | "ẻ" + | "ẽ" + | "ế" + | "ề" + | "ể" + | "ễ" + | "ệ" + | "ỉ" + | "ị" + | "ọ" + | "ỏ" + | "ố" + | "ồ" + | "ổ" + | "ỗ" + | "ộ" + | "ớ" + | "ờ" + | "ở" + | "ỡ" + | "ợ" + | "ụ" + | "ủ" + | "ứ" + | "ừ" + | "ử" + | "ữ" + | "ự" + | "ỳ" + | "ỵ" + | "ỷ" + | "ỹ" + | "ἀ" + | "ἁ" + | "ἂ" + | "ἃ" + | "ἄ" + | "ἅ" + | "ἆ" + | "ἇ" + | "ἐ" + | "ἑ" + | "ἒ" + | "ἓ" + | "ἔ" + | "ἕ" + | "ἠ" + | "ἡ" + | "ἢ" + | "ἣ" + | "ἤ" + | "ἥ" + | "ἦ" + | "ἧ" + | "ἰ" + | "ἱ" + | "ἲ" + | "ἳ" + | "ἴ" + | "ἵ" + | "ἶ" + | "ἷ" + | "ὀ" + | "ὁ" + | "ὂ" + | "ὃ" + | "ὄ" + | "ὅ" + | "ὑ" + | "ὓ" + | "ὕ" + | "ὗ" + | "ὠ" + | "ὡ" + | "ὢ" + | "ὣ" + | "ὤ" + | "ὥ" + | "ὦ" + | "ὧ" + | "ᾀ" + | "ᾁ" + | "ᾂ" + | "ᾃ" + | "ᾄ" + | "ᾅ" + | "ᾆ" + | "ᾇ" + | "ᾐ" + | "ᾑ" + | "ᾒ" + | "ᾓ" + | "ᾔ" + | "ᾕ" + | "ᾖ" + | "ᾗ" + | "ᾠ" + | "ᾡ" + | "ᾢ" + | "ᾣ" + | "ᾤ" + | "ᾥ" + | "ᾦ" + | "ᾧ" + | "ᾰ" + | "ᾱ" + | "ὰ" + | "ά" + | "ᾳ" + | "ὲ" + | "έ" + | "ὴ" + | "ή" + | "ῃ" + | "ῐ" + | "ῑ" + | "ὶ" + | "ί" + | "ῠ" + | "ῡ" + | "ὺ" + | "ύ" + | "ῥ" + | "ὸ" + | "ό" + | "ὼ" + | "ώ" + | "ῳ" + | "ⓐ" + | "ⓑ" + | "ⓒ" + | "ⓓ" + | "ⓔ" + | "ⓕ" + | "ⓖ" + | "ⓗ" + | "ⓘ" + | "ⓙ" + | "ⓚ" + | "ⓛ" + | "ⓜ" + | "ⓝ" + | "ⓞ" + | "ⓟ" + | "ⓠ" + | "ⓡ" + | "ⓢ" + | "ⓣ" + | "ⓤ" + | "ⓥ" + | "ⓦ" + | "ⓧ" + | "ⓨ" + | "ⓩ" + | "ⰰ" + | "ⰱ" + | "ⰲ" + | "ⰳ" + | "ⰴ" + | "ⰵ" + | "ⰶ" + | "ⰷ" + | "ⰸ" + | "ⰹ" + | "ⰺ" + | "ⰻ" + | "ⰼ" + | "ⰽ" + | "ⰾ" + | "ⰿ" + | "ⱀ" + | "ⱁ" + | "ⱂ" + | "ⱃ" + | "ⱄ" + | "ⱅ" + | "ⱆ" + | "ⱇ" + | "ⱈ" + | "ⱉ" + | "ⱊ" + | "ⱋ" + | "ⱌ" + | "ⱍ" + | "ⱎ" + | "ⱏ" + | "ⱐ" + | "ⱑ" + | "ⱒ" + | "ⱓ" + | "ⱔ" + | "ⱕ" + | "ⱖ" + | "ⱗ" + | "ⱘ" + | "ⱙ" + | "ⱚ" + | "ⱛ" + | "ⱜ" + | "ⱝ" + | "ⱞ" + | "ⲁ" + | "ⲃ" + | "ⲅ" + | "ⲇ" + | "ⲉ" + | "ⲋ" + | "ⲍ" + | "ⲏ" + | "ⲑ" + | "ⲓ" + | "ⲕ" + | "ⲗ" + | "ⲙ" + | "ⲛ" + | "ⲝ" + | "ⲟ" + | "ⲡ" + | "ⲣ" + | "ⲥ" + | "ⲧ" + | "ⲩ" + | "ⲫ" + | "ⲭ" + | "ⲯ" + | "ⲱ" + | "ⲳ" + | "ⲵ" + | "ⲷ" + | "ⲹ" + | "ⲻ" + | "ⲽ" + | "ⲿ" + | "ⳁ" + | "ⳃ" + | "ⳅ" + | "ⳇ" + | "ⳉ" + | "ⳋ" + | "ⳍ" + | "ⳏ" + | "ⳑ" + | "ⳓ" + | "ⳕ" + | "ⳗ" + | "ⳙ" + | "ⳛ" + | "ⳝ" + | "ⳟ" + | "ⳡ" + | "ⳣ" + | "a" + | "b" + | "c" + | "d" + | "e" + | "f" + | "g" + | "h" + | "i" + | "j" + | "k" + | "l" + | "m" + | "n" + | "o" + | "p" + | "q" + | "r" + | "s" + | "t" + | "u" + | "v" + | "w" + | "x" + | "y" + | "z" +; + +export toupper_deterministic = Determinize[ + ("a" : "A") + | ("b" : "B") + | ("c" : "C") + | ("d" : "D") + | ("e" : "E") + | ("f" : "F") + | ("g" : "G") + | ("h" : "H") + | ("i" : "I") + | ("j" : "J") + | ("k" : "K") + | ("l" : "L") + | ("m" : "M") + | ("n" : "N") + | ("o" : "O") + | ("p" : "P") + | ("q" : "Q") + | ("r" : "R") + | ("s" : "S") + | ("t" : "T") + | ("u" : "U") + | ("v" : "V") + | ("w" : "W") + | ("x" : "X") + | ("y" : "Y") + | ("z" : "Z") + | ("à" : "À") + | ("á" : "Á") + | ("â" : "Â") + | ("ã" : "Ã") + | ("ä" : "Ä") + | ("å" : "Å") + | ("æ" : "Æ") + | ("ç" : "Ç") + | ("è" : "È") + | ("é" : "É") + | ("ê" : "Ê") + | ("ë" : "Ë") + | ("ì" : "Ì") + | ("í" : "Í") + | ("î" : "Î") + | ("ï" : "Ï") + | ("ð" : "Ð") + | ("ñ" : "Ñ") + | ("ò" : "Ò") + | ("ó" : "Ó") + | ("ô" : "Ô") + | ("õ" : "Õ") + | ("ö" : "Ö") + | ("ø" : "Ø") + | ("ù" : "Ù") + | ("ú" : "Ú") + | ("û" : "Û") + | ("ü" : "Ü") + | ("ý" : "Ý") + | ("þ" : "Þ") + | ("ā" : "Ā") + | ("ă" : "Ă") + | ("ą" : "Ą") + | ("ć" : "Ć") + | ("ĉ" : "Ĉ") + | ("ċ" : "Ċ") + | ("č" : "Č") + | ("ď" : "Ď") + | ("đ" : "Đ") + | ("ē" : "Ē") + | ("ĕ" : "Ĕ") + | ("ė" : "Ė") + | ("ę" : "Ę") + | ("ě" : "Ě") + | ("ĝ" : "Ĝ") + | ("ğ" : "Ğ") + | ("ġ" : "Ġ") + | ("ģ" : "Ģ") + | ("ĥ" : "Ĥ") + | ("ħ" : "Ħ") + | ("ĩ" : "Ĩ") + | ("ī" : "Ī") + | ("ĭ" : "Ĭ") + | ("į" : "Į") + | ("ij" : "IJ") + | ("ĵ" : "Ĵ") + | ("ķ" : "Ķ") + | ("ĺ" : "Ĺ") + | ("ļ" : "Ļ") + | ("ľ" : "Ľ") + | ("ŀ" : "Ŀ") + | ("ł" : "Ł") + | ("ń" : "Ń") + | ("ņ" : "Ņ") + | ("ň" : "Ň") + | ("ŋ" : "Ŋ") + | ("ō" : "Ō") + | ("ŏ" : "Ŏ") + | ("ő" : "Ő") + | ("œ" : "Œ") + | ("ŕ" : "Ŕ") + | ("ŗ" : "Ŗ") + | ("ř" : "Ř") + | ("ś" : "Ś") + | ("ŝ" : "Ŝ") + | ("ş" : "Ş") + | ("š" : "Š") + | ("ţ" : "Ţ") + | ("ť" : "Ť") + | ("ŧ" : "Ŧ") + | ("ũ" : "Ũ") + | ("ū" : "Ū") + | ("ŭ" : "Ŭ") + | ("ů" : "Ů") + | ("ű" : "Ű") + | ("ų" : "Ų") + | ("ŵ" : "Ŵ") + | ("ŷ" : "Ŷ") + | ("ÿ" : "Ÿ") + | ("ź" : "Ź") + | ("ż" : "Ż") + | ("ž" : "Ž") + | ("ɓ" : "Ɓ") + | ("ƃ" : "Ƃ") + | ("ƅ" : "Ƅ") + | ("ɔ" : "Ɔ") + | ("ƈ" : "Ƈ") + | ("ɖ" : "Ɖ") + | ("ɗ" : "Ɗ") + | ("ƌ" : "Ƌ") + | ("ǝ" : "Ǝ") + | ("ə" : "Ə") + | ("ɛ" : "Ɛ") + | ("ƒ" : "Ƒ") + | ("ɠ" : "Ɠ") + | ("ɣ" : "Ɣ") + | ("ɩ" : "Ɩ") + | ("ɨ" : "Ɨ") + | ("ƙ" : "Ƙ") + | ("ɯ" : "Ɯ") + | ("ɲ" : "Ɲ") + | ("ɵ" : "Ɵ") + | ("ơ" : "Ơ") + | ("ƣ" : "Ƣ") + | ("ƥ" : "Ƥ") + | ("ƨ" : "Ƨ") + | ("ʃ" : "Ʃ") + | ("ƭ" : "Ƭ") + | ("ʈ" : "Ʈ") + | ("ư" : "Ư") + | ("ʊ" : "Ʊ") + | ("ʋ" : "Ʋ") + | ("ƴ" : "Ƴ") + | ("ƶ" : "Ƶ") + | ("ʒ" : "Ʒ") + | ("ƹ" : "Ƹ") + | ("ƽ" : "Ƽ") + | ("dž" : "DŽ") + | ("lj" : "LJ") + | ("nj" : "NJ") + | ("ǎ" : "Ǎ") + | ("ǐ" : "Ǐ") + | ("ǒ" : "Ǒ") + | ("ǔ" : "Ǔ") + | ("ǖ" : "Ǖ") + | ("ǘ" : "Ǘ") + | ("ǚ" : "Ǚ") + | ("ǜ" : "Ǜ") + | ("ǟ" : "Ǟ") + | ("ǡ" : "Ǡ") + | ("ǣ" : "Ǣ") + | ("ǥ" : "Ǥ") + | ("ǧ" : "Ǧ") + | ("ǩ" : "Ǩ") + | ("ǫ" : "Ǫ") + | ("ǭ" : "Ǭ") + | ("ǯ" : "Ǯ") + | ("dz" : "DZ") + | ("ǵ" : "Ǵ") + | ("ƕ" : "Ƕ") + | ("ƿ" : "Ƿ") + | ("ǹ" : "Ǹ") + | ("ǻ" : "Ǻ") + | ("ǽ" : "Ǽ") + | ("ǿ" : "Ǿ") + | ("ȁ" : "Ȁ") + | ("ȃ" : "Ȃ") + | ("ȅ" : "Ȅ") + | ("ȇ" : "Ȇ") + | ("ȉ" : "Ȉ") + | ("ȋ" : "Ȋ") + | ("ȍ" : "Ȍ") + | ("ȏ" : "Ȏ") + | ("ȑ" : "Ȑ") + | ("ȓ" : "Ȓ") + | ("ȕ" : "Ȕ") + | ("ȗ" : "Ȗ") + | ("ș" : "Ș") + | ("ț" : "Ț") + | ("ȝ" : "Ȝ") + | ("ȟ" : "Ȟ") + | ("ƞ" : "Ƞ") + | ("ȣ" : "Ȣ") + | ("ȥ" : "Ȥ") + | ("ȧ" : "Ȧ") + | ("ȩ" : "Ȩ") + | ("ȫ" : "Ȫ") + | ("ȭ" : "Ȭ") + | ("ȯ" : "Ȯ") + | ("ȱ" : "Ȱ") + | ("ȳ" : "Ȳ") + | ("ȼ" : "Ȼ") + | ("ƚ" : "Ƚ") + | ("ʔ" : "Ɂ") + | ("ά" : "Ά") + | ("έ" : "Έ") + | ("ή" : "Ή") + | ("ί" : "Ί") + | ("ό" : "Ό") + | ("ύ" : "Ύ") + | ("ώ" : "Ώ") + | ("α" : "Α") + | ("β" : "Β") + | ("γ" : "Γ") + | ("δ" : "Δ") + | ("ε" : "Ε") + | ("ζ" : "Ζ") + | ("η" : "Η") + | ("θ" : "Θ") + | ("ι" : "Ι") + | ("κ" : "Κ") + | ("λ" : "Λ") + | ("μ" : "Μ") + | ("ν" : "Ν") + | ("ξ" : "Ξ") + | ("ο" : "Ο") + | ("π" : "Π") + | ("ρ" : "Ρ") + | ("σ" : "Σ") + | ("τ" : "Τ") + | ("υ" : "Υ") + | ("φ" : "Φ") + | ("χ" : "Χ") + | ("ψ" : "Ψ") + | ("ω" : "Ω") + | ("ϊ" : "Ϊ") + | ("ϋ" : "Ϋ") + | ("ϣ" : "Ϣ") + | ("ϥ" : "Ϥ") + | ("ϧ" : "Ϧ") + | ("ϩ" : "Ϩ") + | ("ϫ" : "Ϫ") + | ("ϭ" : "Ϭ") + | ("ϯ" : "Ϯ") + | ("ϸ" : "Ϸ") + | ("ϲ" : "Ϲ") + | ("ϻ" : "Ϻ") + | ("ѐ" : "Ѐ") + | ("ё" : "Ё") + | ("ђ" : "Ђ") + | ("ѓ" : "Ѓ") + | ("є" : "Є") + | ("ѕ" : "Ѕ") + | ("і" : "І") + | ("ї" : "Ї") + | ("ј" : "Ј") + | ("љ" : "Љ") + | ("њ" : "Њ") + | ("ћ" : "Ћ") + | ("ќ" : "Ќ") + | ("ѝ" : "Ѝ") + | ("ў" : "Ў") + | ("џ" : "Џ") + | ("а" : "А") + | ("б" : "Б") + | ("в" : "В") + | ("г" : "Г") + | ("д" : "Д") + | ("е" : "Е") + | ("ж" : "Ж") + | ("з" : "З") + | ("и" : "И") + | ("й" : "Й") + | ("к" : "К") + | ("л" : "Л") + | ("м" : "М") + | ("н" : "Н") + | ("о" : "О") + | ("п" : "П") + | ("р" : "Р") + | ("с" : "С") + | ("т" : "Т") + | ("у" : "У") + | ("ф" : "Ф") + | ("х" : "Х") + | ("ц" : "Ц") + | ("ч" : "Ч") + | ("ш" : "Ш") + | ("щ" : "Щ") + | ("ъ" : "Ъ") + | ("ы" : "Ы") + | ("ь" : "Ь") + | ("э" : "Э") + | ("ю" : "Ю") + | ("я" : "Я") + | ("ѡ" : "Ѡ") + | ("ѣ" : "Ѣ") + | ("ѥ" : "Ѥ") + | ("ѧ" : "Ѧ") + | ("ѩ" : "Ѩ") + | ("ѫ" : "Ѫ") + | ("ѭ" : "Ѭ") + | ("ѯ" : "Ѯ") + | ("ѱ" : "Ѱ") + | ("ѳ" : "Ѳ") + | ("ѵ" : "Ѵ") + | ("ѷ" : "Ѷ") + | ("ѹ" : "Ѹ") + | ("ѻ" : "Ѻ") + | ("ѽ" : "Ѽ") + | ("ѿ" : "Ѿ") + | ("ҁ" : "Ҁ") + | ("ҋ" : "Ҋ") + | ("ҍ" : "Ҍ") + | ("ҏ" : "Ҏ") + | ("ґ" : "Ґ") + | ("ғ" : "Ғ") + | ("ҕ" : "Ҕ") + | ("җ" : "Җ") + | ("ҙ" : "Ҙ") + | ("қ" : "Қ") + | ("ҝ" : "Ҝ") + | ("ҟ" : "Ҟ") + | ("ҡ" : "Ҡ") + | ("ң" : "Ң") + | ("ҥ" : "Ҥ") + | ("ҧ" : "Ҧ") + | ("ҩ" : "Ҩ") + | ("ҫ" : "Ҫ") + | ("ҭ" : "Ҭ") + | ("ү" : "Ү") + | ("ұ" : "Ұ") + | ("ҳ" : "Ҳ") + | ("ҵ" : "Ҵ") + | ("ҷ" : "Ҷ") + | ("ҹ" : "Ҹ") + | ("һ" : "Һ") + | ("ҽ" : "Ҽ") + | ("ҿ" : "Ҿ") + | ("ӂ" : "Ӂ") + | ("ӄ" : "Ӄ") + | ("ӆ" : "Ӆ") + | ("ӈ" : "Ӈ") + | ("ӊ" : "Ӊ") + | ("ӌ" : "Ӌ") + | ("ӎ" : "Ӎ") + | ("ӑ" : "Ӑ") + | ("ӓ" : "Ӓ") + | ("ӕ" : "Ӕ") + | ("ӗ" : "Ӗ") + | ("ә" : "Ә") + | ("ӛ" : "Ӛ") + | ("ӝ" : "Ӝ") + | ("ӟ" : "Ӟ") + | ("ӡ" : "Ӡ") + | ("ӣ" : "Ӣ") + | ("ӥ" : "Ӥ") + | ("ӧ" : "Ӧ") + | ("ө" : "Ө") + | ("ӫ" : "Ӫ") + | ("ӭ" : "Ӭ") + | ("ӯ" : "Ӯ") + | ("ӱ" : "Ӱ") + | ("ӳ" : "Ӳ") + | ("ӵ" : "Ӵ") + | ("ӷ" : "Ӷ") + | ("ӹ" : "Ӹ") + | ("ԁ" : "Ԁ") + | ("ԃ" : "Ԃ") + | ("ԅ" : "Ԅ") + | ("ԇ" : "Ԇ") + | ("ԉ" : "Ԉ") + | ("ԋ" : "Ԋ") + | ("ԍ" : "Ԍ") + | ("ԏ" : "Ԏ") + | ("ա" : "Ա") + | ("բ" : "Բ") + | ("գ" : "Գ") + | ("դ" : "Դ") + | ("ե" : "Ե") + | ("զ" : "Զ") + | ("է" : "Է") + | ("ը" : "Ը") + | ("թ" : "Թ") + | ("ժ" : "Ժ") + | ("ի" : "Ի") + | ("լ" : "Լ") + | ("խ" : "Խ") + | ("ծ" : "Ծ") + | ("կ" : "Կ") + | ("հ" : "Հ") + | ("ձ" : "Ձ") + | ("ղ" : "Ղ") + | ("ճ" : "Ճ") + | ("մ" : "Մ") + | ("յ" : "Յ") + | ("ն" : "Ն") + | ("շ" : "Շ") + | ("ո" : "Ո") + | ("չ" : "Չ") + | ("պ" : "Պ") + | ("ջ" : "Ջ") + | ("ռ" : "Ռ") + | ("ս" : "Ս") + | ("վ" : "Վ") + | ("տ" : "Տ") + | ("ր" : "Ր") + | ("ց" : "Ց") + | ("ւ" : "Ւ") + | ("փ" : "Փ") + | ("ք" : "Ք") + | ("օ" : "Օ") + | ("ֆ" : "Ֆ") + | ("ⴀ" : "Ⴀ") + | ("ⴁ" : "Ⴁ") + | ("ⴂ" : "Ⴂ") + | ("ⴃ" : "Ⴃ") + | ("ⴄ" : "Ⴄ") + | ("ⴅ" : "Ⴅ") + | ("ⴆ" : "Ⴆ") + | ("ⴇ" : "Ⴇ") + | ("ⴈ" : "Ⴈ") + | ("ⴉ" : "Ⴉ") + | ("ⴊ" : "Ⴊ") + | ("ⴋ" : "Ⴋ") + | ("ⴌ" : "Ⴌ") + | ("ⴍ" : "Ⴍ") + | ("ⴎ" : "Ⴎ") + | ("ⴏ" : "Ⴏ") + | ("ⴐ" : "Ⴐ") + | ("ⴑ" : "Ⴑ") + | ("ⴒ" : "Ⴒ") + | ("ⴓ" : "Ⴓ") + | ("ⴔ" : "Ⴔ") + | ("ⴕ" : "Ⴕ") + | ("ⴖ" : "Ⴖ") + | ("ⴗ" : "Ⴗ") + | ("ⴘ" : "Ⴘ") + | ("ⴙ" : "Ⴙ") + | ("ⴚ" : "Ⴚ") + | ("ⴛ" : "Ⴛ") + | ("ⴜ" : "Ⴜ") + | ("ⴝ" : "Ⴝ") + | ("ⴞ" : "Ⴞ") + | ("ⴟ" : "Ⴟ") + | ("ⴠ" : "Ⴠ") + | ("ⴡ" : "Ⴡ") + | ("ⴢ" : "Ⴢ") + | ("ⴣ" : "Ⴣ") + | ("ⴤ" : "Ⴤ") + | ("ⴥ" : "Ⴥ") + | ("ḁ" : "Ḁ") + | ("ḃ" : "Ḃ") + | ("ḅ" : "Ḅ") + | ("ḇ" : "Ḇ") + | ("ḉ" : "Ḉ") + | ("ḋ" : "Ḋ") + | ("ḍ" : "Ḍ") + | ("ḏ" : "Ḏ") + | ("ḑ" : "Ḑ") + | ("ḓ" : "Ḓ") + | ("ḕ" : "Ḕ") + | ("ḗ" : "Ḗ") + | ("ḙ" : "Ḙ") + | ("ḛ" : "Ḛ") + | ("ḝ" : "Ḝ") + | ("ḟ" : "Ḟ") + | ("ḡ" : "Ḡ") + | ("ḣ" : "Ḣ") + | ("ḥ" : "Ḥ") + | ("ḧ" : "Ḧ") + | ("ḩ" : "Ḩ") + | ("ḫ" : "Ḫ") + | ("ḭ" : "Ḭ") + | ("ḯ" : "Ḯ") + | ("ḱ" : "Ḱ") + | ("ḳ" : "Ḳ") + | ("ḵ" : "Ḵ") + | ("ḷ" : "Ḷ") + | ("ḹ" : "Ḹ") + | ("ḻ" : "Ḻ") + | ("ḽ" : "Ḽ") + | ("ḿ" : "Ḿ") + | ("ṁ" : "Ṁ") + | ("ṃ" : "Ṃ") + | ("ṅ" : "Ṅ") + | ("ṇ" : "Ṇ") + | ("ṉ" : "Ṉ") + | ("ṋ" : "Ṋ") + | ("ṍ" : "Ṍ") + | ("ṏ" : "Ṏ") + | ("ṑ" : "Ṑ") + | ("ṓ" : "Ṓ") + | ("ṕ" : "Ṕ") + | ("ṗ" : "Ṗ") + | ("ṙ" : "Ṙ") + | ("ṛ" : "Ṛ") + | ("ṝ" : "Ṝ") + | ("ṟ" : "Ṟ") + | ("ṡ" : "Ṡ") + | ("ṣ" : "Ṣ") + | ("ṥ" : "Ṥ") + | ("ṧ" : "Ṧ") + | ("ṩ" : "Ṩ") + | ("ṫ" : "Ṫ") + | ("ṭ" : "Ṭ") + | ("ṯ" : "Ṯ") + | ("ṱ" : "Ṱ") + | ("ṳ" : "Ṳ") + | ("ṵ" : "Ṵ") + | ("ṷ" : "Ṷ") + | ("ṹ" : "Ṹ") + | ("ṻ" : "Ṻ") + | ("ṽ" : "Ṽ") + | ("ṿ" : "Ṿ") + | ("ẁ" : "Ẁ") + | ("ẃ" : "Ẃ") + | ("ẅ" : "Ẅ") + | ("ẇ" : "Ẇ") + | ("ẉ" : "Ẉ") + | ("ẋ" : "Ẋ") + | ("ẍ" : "Ẍ") + | ("ẏ" : "Ẏ") + | ("ẑ" : "Ẑ") + | ("ẓ" : "Ẓ") + | ("ẕ" : "Ẕ") + | ("ạ" : "Ạ") + | ("ả" : "Ả") + | ("ấ" : "Ấ") + | ("ầ" : "Ầ") + | ("ẩ" : "Ẩ") + | ("ẫ" : "Ẫ") + | ("ậ" : "Ậ") + | ("ắ" : "Ắ") + | ("ằ" : "Ằ") + | ("ẳ" : "Ẳ") + | ("ẵ" : "Ẵ") + | ("ặ" : "Ặ") + | ("ẹ" : "Ẹ") + | ("ẻ" : "Ẻ") + | ("ẽ" : "Ẽ") + | ("ế" : "Ế") + | ("ề" : "Ề") + | ("ể" : "Ể") + | ("ễ" : "Ễ") + | ("ệ" : "Ệ") + | ("ỉ" : "Ỉ") + | ("ị" : "Ị") + | ("ọ" : "Ọ") + | ("ỏ" : "Ỏ") + | ("ố" : "Ố") + | ("ồ" : "Ồ") + | ("ổ" : "Ổ") + | ("ỗ" : "Ỗ") + | ("ộ" : "Ộ") + | ("ớ" : "Ớ") + | ("ờ" : "Ờ") + | ("ở" : "Ở") + | ("ỡ" : "Ỡ") + | ("ợ" : "Ợ") + | ("ụ" : "Ụ") + | ("ủ" : "Ủ") + | ("ứ" : "Ứ") + | ("ừ" : "Ừ") + | ("ử" : "Ử") + | ("ữ" : "Ữ") + | ("ự" : "Ự") + | ("ỳ" : "Ỳ") + | ("ỵ" : "Ỵ") + | ("ỷ" : "Ỷ") + | ("ỹ" : "Ỹ") + | ("ἀ" : "Ἀ") + | ("ἁ" : "Ἁ") + | ("ἂ" : "Ἂ") + | ("ἃ" : "Ἃ") + | ("ἄ" : "Ἄ") + | ("ἅ" : "Ἅ") + | ("ἆ" : "Ἆ") + | ("ἇ" : "Ἇ") + | ("ἐ" : "Ἐ") + | ("ἑ" : "Ἑ") + | ("ἒ" : "Ἒ") + | ("ἓ" : "Ἓ") + | ("ἔ" : "Ἔ") + | ("ἕ" : "Ἕ") + | ("ἠ" : "Ἠ") + | ("ἡ" : "Ἡ") + | ("ἢ" : "Ἢ") + | ("ἣ" : "Ἣ") + | ("ἤ" : "Ἤ") + | ("ἥ" : "Ἥ") + | ("ἦ" : "Ἦ") + | ("ἧ" : "Ἧ") + | ("ἰ" : "Ἰ") + | ("ἱ" : "Ἱ") + | ("ἲ" : "Ἲ") + | ("ἳ" : "Ἳ") + | ("ἴ" : "Ἴ") + | ("ἵ" : "Ἵ") + | ("ἶ" : "Ἶ") + | ("ἷ" : "Ἷ") + | ("ὀ" : "Ὀ") + | ("ὁ" : "Ὁ") + | ("ὂ" : "Ὂ") + | ("ὃ" : "Ὃ") + | ("ὄ" : "Ὄ") + | ("ὅ" : "Ὅ") + | ("ὑ" : "Ὑ") + | ("ὓ" : "Ὓ") + | ("ὕ" : "Ὕ") + | ("ὗ" : "Ὗ") + | ("ὠ" : "Ὠ") + | ("ὡ" : "Ὡ") + | ("ὢ" : "Ὢ") + | ("ὣ" : "Ὣ") + | ("ὤ" : "Ὤ") + | ("ὥ" : "Ὥ") + | ("ὦ" : "Ὦ") + | ("ὧ" : "Ὧ") + | ("ᾀ" : "ᾈ") + | ("ᾁ" : "ᾉ") + | ("ᾂ" : "ᾊ") + | ("ᾃ" : "ᾋ") + | ("ᾄ" : "ᾌ") + | ("ᾅ" : "ᾍ") + | ("ᾆ" : "ᾎ") + | ("ᾇ" : "ᾏ") + | ("ᾐ" : "ᾘ") + | ("ᾑ" : "ᾙ") + | ("ᾒ" : "ᾚ") + | ("ᾓ" : "ᾛ") + | ("ᾔ" : "ᾜ") + | ("ᾕ" : "ᾝ") + | ("ᾖ" : "ᾞ") + | ("ᾗ" : "ᾟ") + | ("ᾠ" : "ᾨ") + | ("ᾡ" : "ᾩ") + | ("ᾢ" : "ᾪ") + | ("ᾣ" : "ᾫ") + | ("ᾤ" : "ᾬ") + | ("ᾥ" : "ᾭ") + | ("ᾦ" : "ᾮ") + | ("ᾧ" : "ᾯ") + | ("ᾰ" : "Ᾰ") + | ("ᾱ" : "Ᾱ") + | ("ὰ" : "Ὰ") + | ("ά" : "Ά") + | ("ᾳ" : "ᾼ") + | ("ὲ" : "Ὲ") + | ("έ" : "Έ") + | ("ὴ" : "Ὴ") + | ("ή" : "Ή") + | ("ῃ" : "ῌ") + | ("ῐ" : "Ῐ") + | ("ῑ" : "Ῑ") + | ("ὶ" : "Ὶ") + | ("ί" : "Ί") + | ("ῠ" : "Ῠ") + | ("ῡ" : "Ῡ") + | ("ὺ" : "Ὺ") + | ("ύ" : "Ύ") + | ("ῥ" : "Ῥ") + | ("ὸ" : "Ὸ") + | ("ό" : "Ό") + | ("ὼ" : "Ὼ") + | ("ώ" : "Ώ") + | ("ῳ" : "ῼ") + | ("ⓐ" : "Ⓐ") + | ("ⓑ" : "Ⓑ") + | ("ⓒ" : "Ⓒ") + | ("ⓓ" : "Ⓓ") + | ("ⓔ" : "Ⓔ") + | ("ⓕ" : "Ⓕ") + | ("ⓖ" : "Ⓖ") + | ("ⓗ" : "Ⓗ") + | ("ⓘ" : "Ⓘ") + | ("ⓙ" : "Ⓙ") + | ("ⓚ" : "Ⓚ") + | ("ⓛ" : "Ⓛ") + | ("ⓜ" : "Ⓜ") + | ("ⓝ" : "Ⓝ") + | ("ⓞ" : "Ⓞ") + | ("ⓟ" : "Ⓟ") + | ("ⓠ" : "Ⓠ") + | ("ⓡ" : "Ⓡ") + | ("ⓢ" : "Ⓢ") + | ("ⓣ" : "Ⓣ") + | ("ⓤ" : "Ⓤ") + | ("ⓥ" : "Ⓥ") + | ("ⓦ" : "Ⓦ") + | ("ⓧ" : "Ⓧ") + | ("ⓨ" : "Ⓨ") + | ("ⓩ" : "Ⓩ") + | ("ⰰ" : "Ⰰ") + | ("ⰱ" : "Ⰱ") + | ("ⰲ" : "Ⰲ") + | ("ⰳ" : "Ⰳ") + | ("ⰴ" : "Ⰴ") + | ("ⰵ" : "Ⰵ") + | ("ⰶ" : "Ⰶ") + | ("ⰷ" : "Ⰷ") + | ("ⰸ" : "Ⰸ") + | ("ⰹ" : "Ⰹ") + | ("ⰺ" : "Ⰺ") + | ("ⰻ" : "Ⰻ") + | ("ⰼ" : "Ⰼ") + | ("ⰽ" : "Ⰽ") + | ("ⰾ" : "Ⰾ") + | ("ⰿ" : "Ⰿ") + | ("ⱀ" : "Ⱀ") + | ("ⱁ" : "Ⱁ") + | ("ⱂ" : "Ⱂ") + | ("ⱃ" : "Ⱃ") + | ("ⱄ" : "Ⱄ") + | ("ⱅ" : "Ⱅ") + | ("ⱆ" : "Ⱆ") + | ("ⱇ" : "Ⱇ") + | ("ⱈ" : "Ⱈ") + | ("ⱉ" : "Ⱉ") + | ("ⱊ" : "Ⱊ") + | ("ⱋ" : "Ⱋ") + | ("ⱌ" : "Ⱌ") + | ("ⱍ" : "Ⱍ") + | ("ⱎ" : "Ⱎ") + | ("ⱏ" : "Ⱏ") + | ("ⱐ" : "Ⱐ") + | ("ⱑ" : "Ⱑ") + | ("ⱒ" : "Ⱒ") + | ("ⱓ" : "Ⱓ") + | ("ⱔ" : "Ⱔ") + | ("ⱕ" : "Ⱕ") + | ("ⱖ" : "Ⱖ") + | ("ⱗ" : "Ⱗ") + | ("ⱘ" : "Ⱘ") + | ("ⱙ" : "Ⱙ") + | ("ⱚ" : "Ⱚ") + | ("ⱛ" : "Ⱛ") + | ("ⱜ" : "Ⱜ") + | ("ⱝ" : "Ⱝ") + | ("ⱞ" : "Ⱞ") + | ("ⲁ" : "Ⲁ") + | ("ⲃ" : "Ⲃ") + | ("ⲅ" : "Ⲅ") + | ("ⲇ" : "Ⲇ") + | ("ⲉ" : "Ⲉ") + | ("ⲋ" : "Ⲋ") + | ("ⲍ" : "Ⲍ") + | ("ⲏ" : "Ⲏ") + | ("ⲑ" : "Ⲑ") + | ("ⲓ" : "Ⲓ") + | ("ⲕ" : "Ⲕ") + | ("ⲗ" : "Ⲗ") + | ("ⲙ" : "Ⲙ") + | ("ⲛ" : "Ⲛ") + | ("ⲝ" : "Ⲝ") + | ("ⲟ" : "Ⲟ") + | ("ⲡ" : "Ⲡ") + | ("ⲣ" : "Ⲣ") + | ("ⲥ" : "Ⲥ") + | ("ⲧ" : "Ⲧ") + | ("ⲩ" : "Ⲩ") + | ("ⲫ" : "Ⲫ") + | ("ⲭ" : "Ⲭ") + | ("ⲯ" : "Ⲯ") + | ("ⲱ" : "Ⲱ") + | ("ⲳ" : "Ⲳ") + | ("ⲵ" : "Ⲵ") + | ("ⲷ" : "Ⲷ") + | ("ⲹ" : "Ⲹ") + | ("ⲻ" : "Ⲻ") + | ("ⲽ" : "Ⲽ") + | ("ⲿ" : "Ⲿ") + | ("ⳁ" : "Ⳁ") + | ("ⳃ" : "Ⳃ") + | ("ⳅ" : "Ⳅ") + | ("ⳇ" : "Ⳇ") + | ("ⳉ" : "Ⳉ") + | ("ⳋ" : "Ⳋ") + | ("ⳍ" : "Ⳍ") + | ("ⳏ" : "Ⳏ") + | ("ⳑ" : "Ⳑ") + | ("ⳓ" : "Ⳓ") + | ("ⳕ" : "Ⳕ") + | ("ⳗ" : "Ⳗ") + | ("ⳙ" : "Ⳙ") + | ("ⳛ" : "Ⳛ") + | ("ⳝ" : "Ⳝ") + | ("ⳟ" : "Ⳟ") + | ("ⳡ" : "Ⳡ") + | ("ⳣ" : "Ⳣ") + | ("a" : "A") + | ("b" : "B") + | ("c" : "C") + | ("d" : "D") + | ("e" : "E") + | ("f" : "F") + | ("g" : "G") + | ("h" : "H") + | ("i" : "I") + | ("j" : "J") + | ("k" : "K") + | ("l" : "L") + | ("m" : "M") + | ("n" : "N") + | ("o" : "O") + | ("p" : "P") + | ("q" : "Q") + | ("r" : "R") + | ("s" : "S") + | ("t" : "T") + | ("u" : "U") + | ("v" : "V") + | ("w" : "W") + | ("x" : "X") + | ("y" : "Y") + | ("z" : "Z") +]; + +export toupper = toupper_deterministic + | ("i" : "İ") + | ("dž" : "Dž") + | ("lj" : "Lj") + | ("nj" : "Nj") + | ("dz" : "Dz") + | ("θ" : "ϴ"); + +export tolower = + ("A" : "a") + | ("B" : "b") + | ("C" : "c") + | ("D" : "d") + | ("E" : "e") + | ("F" : "f") + | ("G" : "g") + | ("H" : "h") + | ("I" : "i") + | ("J" : "j") + | ("K" : "k") + | ("L" : "l") + | ("M" : "m") + | ("N" : "n") + | ("O" : "o") + | ("P" : "p") + | ("Q" : "q") + | ("R" : "r") + | ("S" : "s") + | ("T" : "t") + | ("U" : "u") + | ("V" : "v") + | ("W" : "w") + | ("X" : "x") + | ("Y" : "y") + | ("Z" : "z") + | ("À" : "à") + | ("Á" : "á") + | ("Â" : "â") + | ("Ã" : "ã") + | ("Ä" : "ä") + | ("Å" : "å") + | ("Æ" : "æ") + | ("Ç" : "ç") + | ("È" : "è") + | ("É" : "é") + | ("Ê" : "ê") + | ("Ë" : "ë") + | ("Ì" : "ì") + | ("Í" : "í") + | ("Î" : "î") + | ("Ï" : "ï") + | ("Ð" : "ð") + | ("Ñ" : "ñ") + | ("Ò" : "ò") + | ("Ó" : "ó") + | ("Ô" : "ô") + | ("Õ" : "õ") + | ("Ö" : "ö") + | ("Ø" : "ø") + | ("Ù" : "ù") + | ("Ú" : "ú") + | ("Û" : "û") + | ("Ü" : "ü") + | ("Ý" : "ý") + | ("Þ" : "þ") + | ("Ā" : "ā") + | ("Ă" : "ă") + | ("Ą" : "ą") + | ("Ć" : "ć") + | ("Ĉ" : "ĉ") + | ("Ċ" : "ċ") + | ("Č" : "č") + | ("Ď" : "ď") + | ("Đ" : "đ") + | ("Ē" : "ē") + | ("Ĕ" : "ĕ") + | ("Ė" : "ė") + | ("Ę" : "ę") + | ("Ě" : "ě") + | ("Ĝ" : "ĝ") + | ("Ğ" : "ğ") + | ("Ġ" : "ġ") + | ("Ģ" : "ģ") + | ("Ĥ" : "ĥ") + | ("Ħ" : "ħ") + | ("Ĩ" : "ĩ") + | ("Ī" : "ī") + | ("Ĭ" : "ĭ") + | ("Į" : "į") + | ("İ" : "i") + | ("IJ" : "ij") + | ("Ĵ" : "ĵ") + | ("Ķ" : "ķ") + | ("Ĺ" : "ĺ") + | ("Ļ" : "ļ") + | ("Ľ" : "ľ") + | ("Ŀ" : "ŀ") + | ("Ł" : "ł") + | ("Ń" : "ń") + | ("Ņ" : "ņ") + | ("Ň" : "ň") + | ("Ŋ" : "ŋ") + | ("Ō" : "ō") + | ("Ŏ" : "ŏ") + | ("Ő" : "ő") + | ("Œ" : "œ") + | ("Ŕ" : "ŕ") + | ("Ŗ" : "ŗ") + | ("Ř" : "ř") + | ("Ś" : "ś") + | ("Ŝ" : "ŝ") + | ("Ş" : "ş") + | ("Š" : "š") + | ("Ţ" : "ţ") + | ("Ť" : "ť") + | ("Ŧ" : "ŧ") + | ("Ũ" : "ũ") + | ("Ū" : "ū") + | ("Ŭ" : "ŭ") + | ("Ů" : "ů") + | ("Ű" : "ű") + | ("Ų" : "ų") + | ("Ŵ" : "ŵ") + | ("Ŷ" : "ŷ") + | ("Ÿ" : "ÿ") + | ("Ź" : "ź") + | ("Ż" : "ż") + | ("Ž" : "ž") + | ("Ɓ" : "ɓ") + | ("Ƃ" : "ƃ") + | ("Ƅ" : "ƅ") + | ("Ɔ" : "ɔ") + | ("Ƈ" : "ƈ") + | ("Ɖ" : "ɖ") + | ("Ɗ" : "ɗ") + | ("Ƌ" : "ƌ") + | ("Ǝ" : "ǝ") + | ("Ə" : "ə") + | ("Ɛ" : "ɛ") + | ("Ƒ" : "ƒ") + | ("Ɠ" : "ɠ") + | ("Ɣ" : "ɣ") + | ("Ɩ" : "ɩ") + | ("Ɨ" : "ɨ") + | ("Ƙ" : "ƙ") + | ("Ɯ" : "ɯ") + | ("Ɲ" : "ɲ") + | ("Ɵ" : "ɵ") + | ("Ơ" : "ơ") + | ("Ƣ" : "ƣ") + | ("Ƥ" : "ƥ") + | ("Ƨ" : "ƨ") + | ("Ʃ" : "ʃ") + | ("Ƭ" : "ƭ") + | ("Ʈ" : "ʈ") + | ("Ư" : "ư") + | ("Ʊ" : "ʊ") + | ("Ʋ" : "ʋ") + | ("Ƴ" : "ƴ") + | ("Ƶ" : "ƶ") + | ("Ʒ" : "ʒ") + | ("Ƹ" : "ƹ") + | ("Ƽ" : "ƽ") + | ("DŽ" : "dž") + | ("Dž" : "dž") + | ("LJ" : "lj") + | ("Lj" : "lj") + | ("NJ" : "nj") + | ("Nj" : "nj") + | ("Ǎ" : "ǎ") + | ("Ǐ" : "ǐ") + | ("Ǒ" : "ǒ") + | ("Ǔ" : "ǔ") + | ("Ǖ" : "ǖ") + | ("Ǘ" : "ǘ") + | ("Ǚ" : "ǚ") + | ("Ǜ" : "ǜ") + | ("Ǟ" : "ǟ") + | ("Ǡ" : "ǡ") + | ("Ǣ" : "ǣ") + | ("Ǥ" : "ǥ") + | ("Ǧ" : "ǧ") + | ("Ǩ" : "ǩ") + | ("Ǫ" : "ǫ") + | ("Ǭ" : "ǭ") + | ("Ǯ" : "ǯ") + | ("DZ" : "dz") + | ("Dz" : "dz") + | ("Ǵ" : "ǵ") + | ("Ƕ" : "ƕ") + | ("Ƿ" : "ƿ") + | ("Ǹ" : "ǹ") + | ("Ǻ" : "ǻ") + | ("Ǽ" : "ǽ") + | ("Ǿ" : "ǿ") + | ("Ȁ" : "ȁ") + | ("Ȃ" : "ȃ") + | ("Ȅ" : "ȅ") + | ("Ȇ" : "ȇ") + | ("Ȉ" : "ȉ") + | ("Ȋ" : "ȋ") + | ("Ȍ" : "ȍ") + | ("Ȏ" : "ȏ") + | ("Ȑ" : "ȑ") + | ("Ȓ" : "ȓ") + | ("Ȕ" : "ȕ") + | ("Ȗ" : "ȗ") + | ("Ș" : "ș") + | ("Ț" : "ț") + | ("Ȝ" : "ȝ") + | ("Ȟ" : "ȟ") + | ("Ƞ" : "ƞ") + | ("Ȣ" : "ȣ") + | ("Ȥ" : "ȥ") + | ("Ȧ" : "ȧ") + | ("Ȩ" : "ȩ") + | ("Ȫ" : "ȫ") + | ("Ȭ" : "ȭ") + | ("Ȯ" : "ȯ") + | ("Ȱ" : "ȱ") + | ("Ȳ" : "ȳ") + | ("Ȼ" : "ȼ") + | ("Ƚ" : "ƚ") + | ("Ɂ" : "ʔ") + | ("Ά" : "ά") + | ("Έ" : "έ") + | ("Ή" : "ή") + | ("Ί" : "ί") + | ("Ό" : "ό") + | ("Ύ" : "ύ") + | ("Ώ" : "ώ") + | ("Α" : "α") + | ("Β" : "β") + | ("Γ" : "γ") + | ("Δ" : "δ") + | ("Ε" : "ε") + | ("Ζ" : "ζ") + | ("Η" : "η") + | ("Θ" : "θ") + | ("Ι" : "ι") + | ("Κ" : "κ") + | ("Λ" : "λ") + | ("Μ" : "μ") + | ("Ν" : "ν") + | ("Ξ" : "ξ") + | ("Ο" : "ο") + | ("Π" : "π") + | ("Ρ" : "ρ") + | ("Σ" : "σ") + | ("Τ" : "τ") + | ("Υ" : "υ") + | ("Φ" : "φ") + | ("Χ" : "χ") + | ("Ψ" : "ψ") + | ("Ω" : "ω") + | ("Ϊ" : "ϊ") + | ("Ϋ" : "ϋ") + | ("Ϣ" : "ϣ") + | ("Ϥ" : "ϥ") + | ("Ϧ" : "ϧ") + | ("Ϩ" : "ϩ") + | ("Ϫ" : "ϫ") + | ("Ϭ" : "ϭ") + | ("Ϯ" : "ϯ") + | ("ϴ" : "θ") + | ("Ϸ" : "ϸ") + | ("Ϲ" : "ϲ") + | ("Ϻ" : "ϻ") + | ("Ѐ" : "ѐ") + | ("Ё" : "ё") + | ("Ђ" : "ђ") + | ("Ѓ" : "ѓ") + | ("Є" : "є") + | ("Ѕ" : "ѕ") + | ("І" : "і") + | ("Ї" : "ї") + | ("Ј" : "ј") + | ("Љ" : "љ") + | ("Њ" : "њ") + | ("Ћ" : "ћ") + | ("Ќ" : "ќ") + | ("Ѝ" : "ѝ") + | ("Ў" : "ў") + | ("Џ" : "џ") + | ("А" : "а") + | ("Б" : "б") + | ("В" : "в") + | ("Г" : "г") + | ("Д" : "д") + | ("Е" : "е") + | ("Ж" : "ж") + | ("З" : "з") + | ("И" : "и") + | ("Й" : "й") + | ("К" : "к") + | ("Л" : "л") + | ("М" : "м") + | ("Н" : "н") + | ("О" : "о") + | ("П" : "п") + | ("Р" : "р") + | ("С" : "с") + | ("Т" : "т") + | ("У" : "у") + | ("Ф" : "ф") + | ("Х" : "х") + | ("Ц" : "ц") + | ("Ч" : "ч") + | ("Ш" : "ш") + | ("Щ" : "щ") + | ("Ъ" : "ъ") + | ("Ы" : "ы") + | ("Ь" : "ь") + | ("Э" : "э") + | ("Ю" : "ю") + | ("Я" : "я") + | ("Ѡ" : "ѡ") + | ("Ѣ" : "ѣ") + | ("Ѥ" : "ѥ") + | ("Ѧ" : "ѧ") + | ("Ѩ" : "ѩ") + | ("Ѫ" : "ѫ") + | ("Ѭ" : "ѭ") + | ("Ѯ" : "ѯ") + | ("Ѱ" : "ѱ") + | ("Ѳ" : "ѳ") + | ("Ѵ" : "ѵ") + | ("Ѷ" : "ѷ") + | ("Ѹ" : "ѹ") + | ("Ѻ" : "ѻ") + | ("Ѽ" : "ѽ") + | ("Ѿ" : "ѿ") + | ("Ҁ" : "ҁ") + | ("Ҋ" : "ҋ") + | ("Ҍ" : "ҍ") + | ("Ҏ" : "ҏ") + | ("Ґ" : "ґ") + | ("Ғ" : "ғ") + | ("Ҕ" : "ҕ") + | ("Җ" : "җ") + | ("Ҙ" : "ҙ") + | ("Қ" : "қ") + | ("Ҝ" : "ҝ") + | ("Ҟ" : "ҟ") + | ("Ҡ" : "ҡ") + | ("Ң" : "ң") + | ("Ҥ" : "ҥ") + | ("Ҧ" : "ҧ") + | ("Ҩ" : "ҩ") + | ("Ҫ" : "ҫ") + | ("Ҭ" : "ҭ") + | ("Ү" : "ү") + | ("Ұ" : "ұ") + | ("Ҳ" : "ҳ") + | ("Ҵ" : "ҵ") + | ("Ҷ" : "ҷ") + | ("Ҹ" : "ҹ") + | ("Һ" : "һ") + | ("Ҽ" : "ҽ") + | ("Ҿ" : "ҿ") + | ("Ӂ" : "ӂ") + | ("Ӄ" : "ӄ") + | ("Ӆ" : "ӆ") + | ("Ӈ" : "ӈ") + | ("Ӊ" : "ӊ") + | ("Ӌ" : "ӌ") + | ("Ӎ" : "ӎ") + | ("Ӑ" : "ӑ") + | ("Ӓ" : "ӓ") + | ("Ӕ" : "ӕ") + | ("Ӗ" : "ӗ") + | ("Ә" : "ә") + | ("Ӛ" : "ӛ") + | ("Ӝ" : "ӝ") + | ("Ӟ" : "ӟ") + | ("Ӡ" : "ӡ") + | ("Ӣ" : "ӣ") + | ("Ӥ" : "ӥ") + | ("Ӧ" : "ӧ") + | ("Ө" : "ө") + | ("Ӫ" : "ӫ") + | ("Ӭ" : "ӭ") + | ("Ӯ" : "ӯ") + | ("Ӱ" : "ӱ") + | ("Ӳ" : "ӳ") + | ("Ӵ" : "ӵ") + | ("Ӷ" : "ӷ") + | ("Ӹ" : "ӹ") + | ("Ԁ" : "ԁ") + | ("Ԃ" : "ԃ") + | ("Ԅ" : "ԅ") + | ("Ԇ" : "ԇ") + | ("Ԉ" : "ԉ") + | ("Ԋ" : "ԋ") + | ("Ԍ" : "ԍ") + | ("Ԏ" : "ԏ") + | ("Ա" : "ա") + | ("Բ" : "բ") + | ("Գ" : "գ") + | ("Դ" : "դ") + | ("Ե" : "ե") + | ("Զ" : "զ") + | ("Է" : "է") + | ("Ը" : "ը") + | ("Թ" : "թ") + | ("Ժ" : "ժ") + | ("Ի" : "ի") + | ("Լ" : "լ") + | ("Խ" : "խ") + | ("Ծ" : "ծ") + | ("Կ" : "կ") + | ("Հ" : "հ") + | ("Ձ" : "ձ") + | ("Ղ" : "ղ") + | ("Ճ" : "ճ") + | ("Մ" : "մ") + | ("Յ" : "յ") + | ("Ն" : "ն") + | ("Շ" : "շ") + | ("Ո" : "ո") + | ("Չ" : "չ") + | ("Պ" : "պ") + | ("Ջ" : "ջ") + | ("Ռ" : "ռ") + | ("Ս" : "ս") + | ("Վ" : "վ") + | ("Տ" : "տ") + | ("Ր" : "ր") + | ("Ց" : "ց") + | ("Ւ" : "ւ") + | ("Փ" : "փ") + | ("Ք" : "ք") + | ("Օ" : "օ") + | ("Ֆ" : "ֆ") + | ("Ⴀ" : "ⴀ") + | ("Ⴁ" : "ⴁ") + | ("Ⴂ" : "ⴂ") + | ("Ⴃ" : "ⴃ") + | ("Ⴄ" : "ⴄ") + | ("Ⴅ" : "ⴅ") + | ("Ⴆ" : "ⴆ") + | ("Ⴇ" : "ⴇ") + | ("Ⴈ" : "ⴈ") + | ("Ⴉ" : "ⴉ") + | ("Ⴊ" : "ⴊ") + | ("Ⴋ" : "ⴋ") + | ("Ⴌ" : "ⴌ") + | ("Ⴍ" : "ⴍ") + | ("Ⴎ" : "ⴎ") + | ("Ⴏ" : "ⴏ") + | ("Ⴐ" : "ⴐ") + | ("Ⴑ" : "ⴑ") + | ("Ⴒ" : "ⴒ") + | ("Ⴓ" : "ⴓ") + | ("Ⴔ" : "ⴔ") + | ("Ⴕ" : "ⴕ") + | ("Ⴖ" : "ⴖ") + | ("Ⴗ" : "ⴗ") + | ("Ⴘ" : "ⴘ") + | ("Ⴙ" : "ⴙ") + | ("Ⴚ" : "ⴚ") + | ("Ⴛ" : "ⴛ") + | ("Ⴜ" : "ⴜ") + | ("Ⴝ" : "ⴝ") + | ("Ⴞ" : "ⴞ") + | ("Ⴟ" : "ⴟ") + | ("Ⴠ" : "ⴠ") + | ("Ⴡ" : "ⴡ") + | ("Ⴢ" : "ⴢ") + | ("Ⴣ" : "ⴣ") + | ("Ⴤ" : "ⴤ") + | ("Ⴥ" : "ⴥ") + | ("Ḁ" : "ḁ") + | ("Ḃ" : "ḃ") + | ("Ḅ" : "ḅ") + | ("Ḇ" : "ḇ") + | ("Ḉ" : "ḉ") + | ("Ḋ" : "ḋ") + | ("Ḍ" : "ḍ") + | ("Ḏ" : "ḏ") + | ("Ḑ" : "ḑ") + | ("Ḓ" : "ḓ") + | ("Ḕ" : "ḕ") + | ("Ḗ" : "ḗ") + | ("Ḙ" : "ḙ") + | ("Ḛ" : "ḛ") + | ("Ḝ" : "ḝ") + | ("Ḟ" : "ḟ") + | ("Ḡ" : "ḡ") + | ("Ḣ" : "ḣ") + | ("Ḥ" : "ḥ") + | ("Ḧ" : "ḧ") + | ("Ḩ" : "ḩ") + | ("Ḫ" : "ḫ") + | ("Ḭ" : "ḭ") + | ("Ḯ" : "ḯ") + | ("Ḱ" : "ḱ") + | ("Ḳ" : "ḳ") + | ("Ḵ" : "ḵ") + | ("Ḷ" : "ḷ") + | ("Ḹ" : "ḹ") + | ("Ḻ" : "ḻ") + | ("Ḽ" : "ḽ") + | ("Ḿ" : "ḿ") + | ("Ṁ" : "ṁ") + | ("Ṃ" : "ṃ") + | ("Ṅ" : "ṅ") + | ("Ṇ" : "ṇ") + | ("Ṉ" : "ṉ") + | ("Ṋ" : "ṋ") + | ("Ṍ" : "ṍ") + | ("Ṏ" : "ṏ") + | ("Ṑ" : "ṑ") + | ("Ṓ" : "ṓ") + | ("Ṕ" : "ṕ") + | ("Ṗ" : "ṗ") + | ("Ṙ" : "ṙ") + | ("Ṛ" : "ṛ") + | ("Ṝ" : "ṝ") + | ("Ṟ" : "ṟ") + | ("Ṡ" : "ṡ") + | ("Ṣ" : "ṣ") + | ("Ṥ" : "ṥ") + | ("Ṧ" : "ṧ") + | ("Ṩ" : "ṩ") + | ("Ṫ" : "ṫ") + | ("Ṭ" : "ṭ") + | ("Ṯ" : "ṯ") + | ("Ṱ" : "ṱ") + | ("Ṳ" : "ṳ") + | ("Ṵ" : "ṵ") + | ("Ṷ" : "ṷ") + | ("Ṹ" : "ṹ") + | ("Ṻ" : "ṻ") + | ("Ṽ" : "ṽ") + | ("Ṿ" : "ṿ") + | ("Ẁ" : "ẁ") + | ("Ẃ" : "ẃ") + | ("Ẅ" : "ẅ") + | ("Ẇ" : "ẇ") + | ("Ẉ" : "ẉ") + | ("Ẋ" : "ẋ") + | ("Ẍ" : "ẍ") + | ("Ẏ" : "ẏ") + | ("Ẑ" : "ẑ") + | ("Ẓ" : "ẓ") + | ("Ẕ" : "ẕ") + | ("Ạ" : "ạ") + | ("Ả" : "ả") + | ("Ấ" : "ấ") + | ("Ầ" : "ầ") + | ("Ẩ" : "ẩ") + | ("Ẫ" : "ẫ") + | ("Ậ" : "ậ") + | ("Ắ" : "ắ") + | ("Ằ" : "ằ") + | ("Ẳ" : "ẳ") + | ("Ẵ" : "ẵ") + | ("Ặ" : "ặ") + | ("Ẹ" : "ẹ") + | ("Ẻ" : "ẻ") + | ("Ẽ" : "ẽ") + | ("Ế" : "ế") + | ("Ề" : "ề") + | ("Ể" : "ể") + | ("Ễ" : "ễ") + | ("Ệ" : "ệ") + | ("Ỉ" : "ỉ") + | ("Ị" : "ị") + | ("Ọ" : "ọ") + | ("Ỏ" : "ỏ") + | ("Ố" : "ố") + | ("Ồ" : "ồ") + | ("Ổ" : "ổ") + | ("Ỗ" : "ỗ") + | ("Ộ" : "ộ") + | ("Ớ" : "ớ") + | ("Ờ" : "ờ") + | ("Ở" : "ở") + | ("Ỡ" : "ỡ") + | ("Ợ" : "ợ") + | ("Ụ" : "ụ") + | ("Ủ" : "ủ") + | ("Ứ" : "ứ") + | ("Ừ" : "ừ") + | ("Ử" : "ử") + | ("Ữ" : "ữ") + | ("Ự" : "ự") + | ("Ỳ" : "ỳ") + | ("Ỵ" : "ỵ") + | ("Ỷ" : "ỷ") + | ("Ỹ" : "ỹ") + | ("Ἀ" : "ἀ") + | ("Ἁ" : "ἁ") + | ("Ἂ" : "ἂ") + | ("Ἃ" : "ἃ") + | ("Ἄ" : "ἄ") + | ("Ἅ" : "ἅ") + | ("Ἆ" : "ἆ") + | ("Ἇ" : "ἇ") + | ("Ἐ" : "ἐ") + | ("Ἑ" : "ἑ") + | ("Ἒ" : "ἒ") + | ("Ἓ" : "ἓ") + | ("Ἔ" : "ἔ") + | ("Ἕ" : "ἕ") + | ("Ἠ" : "ἠ") + | ("Ἡ" : "ἡ") + | ("Ἢ" : "ἢ") + | ("Ἣ" : "ἣ") + | ("Ἤ" : "ἤ") + | ("Ἥ" : "ἥ") + | ("Ἦ" : "ἦ") + | ("Ἧ" : "ἧ") + | ("Ἰ" : "ἰ") + | ("Ἱ" : "ἱ") + | ("Ἲ" : "ἲ") + | ("Ἳ" : "ἳ") + | ("Ἴ" : "ἴ") + | ("Ἵ" : "ἵ") + | ("Ἶ" : "ἶ") + | ("Ἷ" : "ἷ") + | ("Ὀ" : "ὀ") + | ("Ὁ" : "ὁ") + | ("Ὂ" : "ὂ") + | ("Ὃ" : "ὃ") + | ("Ὄ" : "ὄ") + | ("Ὅ" : "ὅ") + | ("Ὑ" : "ὑ") + | ("Ὓ" : "ὓ") + | ("Ὕ" : "ὕ") + | ("Ὗ" : "ὗ") + | ("Ὠ" : "ὠ") + | ("Ὡ" : "ὡ") + | ("Ὢ" : "ὢ") + | ("Ὣ" : "ὣ") + | ("Ὤ" : "ὤ") + | ("Ὥ" : "ὥ") + | ("Ὦ" : "ὦ") + | ("Ὧ" : "ὧ") + | ("ᾈ" : "ᾀ") + | ("ᾉ" : "ᾁ") + | ("ᾊ" : "ᾂ") + | ("ᾋ" : "ᾃ") + | ("ᾌ" : "ᾄ") + | ("ᾍ" : "ᾅ") + | ("ᾎ" : "ᾆ") + | ("ᾏ" : "ᾇ") + | ("ᾘ" : "ᾐ") + | ("ᾙ" : "ᾑ") + | ("ᾚ" : "ᾒ") + | ("ᾛ" : "ᾓ") + | ("ᾜ" : "ᾔ") + | ("ᾝ" : "ᾕ") + | ("ᾞ" : "ᾖ") + | ("ᾟ" : "ᾗ") + | ("ᾨ" : "ᾠ") + | ("ᾩ" : "ᾡ") + | ("ᾪ" : "ᾢ") + | ("ᾫ" : "ᾣ") + | ("ᾬ" : "ᾤ") + | ("ᾭ" : "ᾥ") + | ("ᾮ" : "ᾦ") + | ("ᾯ" : "ᾧ") + | ("Ᾰ" : "ᾰ") + | ("Ᾱ" : "ᾱ") + | ("Ὰ" : "ὰ") + | ("Ά" : "ά") + | ("ᾼ" : "ᾳ") + | ("Ὲ" : "ὲ") + | ("Έ" : "έ") + | ("Ὴ" : "ὴ") + | ("Ή" : "ή") + | ("ῌ" : "ῃ") + | ("Ῐ" : "ῐ") + | ("Ῑ" : "ῑ") + | ("Ὶ" : "ὶ") + | ("Ί" : "ί") + | ("Ῠ" : "ῠ") + | ("Ῡ" : "ῡ") + | ("Ὺ" : "ὺ") + | ("Ύ" : "ύ") + | ("Ῥ" : "ῥ") + | ("Ὸ" : "ὸ") + | ("Ό" : "ό") + | ("Ὼ" : "ὼ") + | ("Ώ" : "ώ") + | ("ῼ" : "ῳ") + | ("Ⓐ" : "ⓐ") + | ("Ⓑ" : "ⓑ") + | ("Ⓒ" : "ⓒ") + | ("Ⓓ" : "ⓓ") + | ("Ⓔ" : "ⓔ") + | ("Ⓕ" : "ⓕ") + | ("Ⓖ" : "ⓖ") + | ("Ⓗ" : "ⓗ") + | ("Ⓘ" : "ⓘ") + | ("Ⓙ" : "ⓙ") + | ("Ⓚ" : "ⓚ") + | ("Ⓛ" : "ⓛ") + | ("Ⓜ" : "ⓜ") + | ("Ⓝ" : "ⓝ") + | ("Ⓞ" : "ⓞ") + | ("Ⓟ" : "ⓟ") + | ("Ⓠ" : "ⓠ") + | ("Ⓡ" : "ⓡ") + | ("Ⓢ" : "ⓢ") + | ("Ⓣ" : "ⓣ") + | ("Ⓤ" : "ⓤ") + | ("Ⓥ" : "ⓥ") + | ("Ⓦ" : "ⓦ") + | ("Ⓧ" : "ⓧ") + | ("Ⓨ" : "ⓨ") + | ("Ⓩ" : "ⓩ") + | ("Ⰰ" : "ⰰ") + | ("Ⰱ" : "ⰱ") + | ("Ⰲ" : "ⰲ") + | ("Ⰳ" : "ⰳ") + | ("Ⰴ" : "ⰴ") + | ("Ⰵ" : "ⰵ") + | ("Ⰶ" : "ⰶ") + | ("Ⰷ" : "ⰷ") + | ("Ⰸ" : "ⰸ") + | ("Ⰹ" : "ⰹ") + | ("Ⰺ" : "ⰺ") + | ("Ⰻ" : "ⰻ") + | ("Ⰼ" : "ⰼ") + | ("Ⰽ" : "ⰽ") + | ("Ⰾ" : "ⰾ") + | ("Ⰿ" : "ⰿ") + | ("Ⱀ" : "ⱀ") + | ("Ⱁ" : "ⱁ") + | ("Ⱂ" : "ⱂ") + | ("Ⱃ" : "ⱃ") + | ("Ⱄ" : "ⱄ") + | ("Ⱅ" : "ⱅ") + | ("Ⱆ" : "ⱆ") + | ("Ⱇ" : "ⱇ") + | ("Ⱈ" : "ⱈ") + | ("Ⱉ" : "ⱉ") + | ("Ⱊ" : "ⱊ") + | ("Ⱋ" : "ⱋ") + | ("Ⱌ" : "ⱌ") + | ("Ⱍ" : "ⱍ") + | ("Ⱎ" : "ⱎ") + | ("Ⱏ" : "ⱏ") + | ("Ⱐ" : "ⱐ") + | ("Ⱑ" : "ⱑ") + | ("Ⱒ" : "ⱒ") + | ("Ⱓ" : "ⱓ") + | ("Ⱔ" : "ⱔ") + | ("Ⱕ" : "ⱕ") + | ("Ⱖ" : "ⱖ") + | ("Ⱗ" : "ⱗ") + | ("Ⱘ" : "ⱘ") + | ("Ⱙ" : "ⱙ") + | ("Ⱚ" : "ⱚ") + | ("Ⱛ" : "ⱛ") + | ("Ⱜ" : "ⱜ") + | ("Ⱝ" : "ⱝ") + | ("Ⱞ" : "ⱞ") + | ("Ⲁ" : "ⲁ") + | ("Ⲃ" : "ⲃ") + | ("Ⲅ" : "ⲅ") + | ("Ⲇ" : "ⲇ") + | ("Ⲉ" : "ⲉ") + | ("Ⲋ" : "ⲋ") + | ("Ⲍ" : "ⲍ") + | ("Ⲏ" : "ⲏ") + | ("Ⲑ" : "ⲑ") + | ("Ⲓ" : "ⲓ") + | ("Ⲕ" : "ⲕ") + | ("Ⲗ" : "ⲗ") + | ("Ⲙ" : "ⲙ") + | ("Ⲛ" : "ⲛ") + | ("Ⲝ" : "ⲝ") + | ("Ⲟ" : "ⲟ") + | ("Ⲡ" : "ⲡ") + | ("Ⲣ" : "ⲣ") + | ("Ⲥ" : "ⲥ") + | ("Ⲧ" : "ⲧ") + | ("Ⲩ" : "ⲩ") + | ("Ⲫ" : "ⲫ") + | ("Ⲭ" : "ⲭ") + | ("Ⲯ" : "ⲯ") + | ("Ⲱ" : "ⲱ") + | ("Ⲳ" : "ⲳ") + | ("Ⲵ" : "ⲵ") + | ("Ⲷ" : "ⲷ") + | ("Ⲹ" : "ⲹ") + | ("Ⲻ" : "ⲻ") + | ("Ⲽ" : "ⲽ") + | ("Ⲿ" : "ⲿ") + | ("Ⳁ" : "ⳁ") + | ("Ⳃ" : "ⳃ") + | ("Ⳅ" : "ⳅ") + | ("Ⳇ" : "ⳇ") + | ("Ⳉ" : "ⳉ") + | ("Ⳋ" : "ⳋ") + | ("Ⳍ" : "ⳍ") + | ("Ⳏ" : "ⳏ") + | ("Ⳑ" : "ⳑ") + | ("Ⳓ" : "ⳓ") + | ("Ⳕ" : "ⳕ") + | ("Ⳗ" : "ⳗ") + | ("Ⳙ" : "ⳙ") + | ("Ⳛ" : "ⳛ") + | ("Ⳝ" : "ⳝ") + | ("Ⳟ" : "ⳟ") + | ("Ⳡ" : "ⳡ") + | ("Ⳣ" : "ⳣ") + | ("A" : "a") + | ("B" : "b") + | ("C" : "c") + | ("D" : "d") + | ("E" : "e") + | ("F" : "f") + | ("G" : "g") + | ("H" : "h") + | ("I" : "i") + | ("J" : "j") + | ("K" : "k") + | ("L" : "l") + | ("M" : "m") + | ("N" : "n") + | ("O" : "o") + | ("P" : "p") + | ("Q" : "q") + | ("R" : "r") + | ("S" : "s") + | ("T" : "t") + | ("U" : "u") + | ("V" : "v") + | ("W" : "w") + | ("X" : "x") + | ("Y" : "y") + | ("Z" : "z") +; + +sigma_star = Optimize[bytelib.kBytes*] ; + +export TOUPPER_DETERMINISTIC = + CDRewrite[toupper_deterministic, "", "", sigma_star, 'ltr', 'obl'] ; + +export TOUPPER = CDRewrite[toupper, "", "", sigma_star, 'ltr', 'obl'] ; + +export TOLOWER = CDRewrite[tolower, "", "", sigma_star, 'ltr', 'obl'] ; + +export a_through_z_toupper = Optimize[ + ("a" : "A") + | ("b" : "B") + | ("c" : "C") + | ("d" : "D") + | ("e" : "E") + | ("f" : "F") + | ("g" : "G") + | ("h" : "H") + | ("i" : "I") + | ("j" : "J") + | ("k" : "K") + | ("l" : "L") + | ("m" : "M") + | ("n" : "N") + | ("o" : "O") + | ("p" : "P") + | ("q" : "Q") + | ("r" : "R") + | ("s" : "S") + | ("t" : "T") + | ("u" : "U") + | ("v" : "V") + | ("w" : "W") + | ("x" : "X") + | ("y" : "Y") + | ("z" : "Z") +]; diff --git a/third_party/chinese_text_normalization/thrax/src/util/germanic.tsv b/third_party/chinese_text_normalization/thrax/src/util/germanic.tsv new file mode 100644 index 000000000..6285e0106 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/util/germanic.tsv @@ -0,0 +1,81 @@ +1 10 11 +2 10 12 +3 10 13 +4 10 14 +5 10 15 +6 10 16 +7 10 17 +8 10 18 +9 10 19 +1 20 21 +2 20 22 +3 20 23 +4 20 24 +5 20 25 +6 20 26 +7 20 27 +8 20 28 +9 20 29 +1 30 31 +2 30 32 +3 30 33 +4 30 34 +5 30 35 +6 30 36 +7 30 37 +8 30 38 +9 30 39 +1 40 41 +2 40 42 +3 40 43 +4 40 44 +5 40 45 +6 40 46 +7 40 47 +8 40 48 +9 40 49 +1 50 51 +2 50 52 +3 50 53 +4 50 54 +5 50 55 +6 50 56 +7 50 57 +8 50 58 +9 50 59 +1 60 61 +2 60 62 +3 60 63 +4 60 64 +5 60 65 +6 60 66 +7 60 67 +8 60 68 +9 60 69 +1 70 71 +2 70 72 +3 70 73 +4 70 74 +5 70 75 +6 70 76 +7 70 77 +8 70 78 +9 70 79 +1 80 81 +2 80 82 +3 80 83 +4 80 84 +5 80 85 +6 80 86 +7 80 87 +8 80 88 +9 80 89 +1 90 91 +2 90 92 +3 90 93 +4 90 94 +5 90 95 +6 90 96 +7 90 97 +8 90 98 +9 90 99 diff --git a/third_party/chinese_text_normalization/thrax/src/util/util.grm b/third_party/chinese_text_normalization/thrax/src/util/util.grm new file mode 100644 index 000000000..bb559235f --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/src/util/util.grm @@ -0,0 +1,528 @@ +# Copyright 2017 Google Inc. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +# Utility functions. + +import 'util/byte.grm' as bytelib; +import 'util/case.grm' as case; + +# A simplification helper function that encapsulates the left-to-right and +# obligatory options. +func CDR[t, l, r, s] { + return CDRewrite[t, l, r, s, 'ltr', 'obl']; +} + +# Useful insertion and deletion functions. + +func I[expr] { + return "" : expr; +} + +func D[expr] { + return expr : ""; +} + +# A machine that accepts nothing. +export NULL = Optimize["" - ""]; + +export d1to9 = Optimize[bytelib.kDigit - "0"]; +export d02to9 = Optimize[bytelib.kDigit - "1"]; +export d2to9 = Optimize[d02to9 - "0"]; +# Any number that isn't zero. May have leading zeroes. +export non_zero_number = Optimize["0"* d1to9 bytelib.kDigit*]; +# Any number, allowing for factorization markers. +export factorized_number = Optimize[(bytelib.kDigit | "\[" | "E" | "\]")*]; +export non_zero_factorized_number = Optimize["0"* d1to9 factorized_number]; + +export ins_space = "" : " "; +export ins_sil = "" : " sil "; +export ins_short_sil = "" : " sil|short "; +export ins_quote = "" : "\""; + +# Caveat: pass_anything does not pass stuff like "[~~]". +export pass_anything = bytelib.kBytes*; +export pass_any_word = bytelib.kNotSpace+; + +export pass_space_plus = bytelib.kSpace+; +export pass_space_star = bytelib.kSpace*; + +export clear_space = bytelib.kSpace : ""; +export clear_space_plus = bytelib.kSpace+ : ""; +export clear_space_star = bytelib.kSpace* : ""; + +export space_to_underscore = (bytelib.kAlnum | (" " : "_"))*; +export one_space = clear_space_star ins_space; + +export CLEAN_SPACES = Optimize[ + "" | (clear_space_star + (pass_any_word (bytelib.kSpace+ : " "))* + pass_any_word clear_space_star)] +; + +export del_space_star = " "* : ""; +export del_space_plus = " "+ : ""; + +export sigma_star = Optimize[pass_anything]; + +export DELETE_SPACES = + CDRewrite[clear_space_plus, "", "", sigma_star]; + +export REMOVE_LEADING_SPACES = + CDRewrite[clear_space_plus, "[BOS]", "", sigma_star]; + +export REMOVE_FINAL_SPACES = + CDRewrite[clear_space_plus, "", "[EOS]", sigma_star]; + +export REMOVE_BOUNDARY_SPACES = REMOVE_LEADING_SPACES @ REMOVE_FINAL_SPACES; + +export delete_initial_zero = + CDRewrite["0" : "", "[BOS]", bytelib.kDigit, sigma_star]; + +export lower_case_letter = Optimize[case.tolower | case.LOWER | bytelib.kLower]; +export lower_case = Optimize[lower_case_letter+]; +export lower_case_anything = case.TOLOWER; + +export upper_case_letter = Optimize[case.toupper | case.UPPER | bytelib.kUpper]; +export upper_case = Optimize[upper_case_letter+]; +export upper_case_anything = case.TOUPPER; + +export opening_brace = del_space_star ("{" : "") del_space_star; +export closing_brace = del_space_star ("}" : "") del_space_star; + +export quote = del_space_star ("\"" : "") del_space_star; +export double_quote = del_space_star ("\"\"" : "") del_space_star; + +export VOWELS = Optimize["a" | "e" | "i" | "o" | "u"]; +export VOWELS_Y = Optimize["a" | "e" | "i" | "o" | "u" | "y"]; +export VOWELS_INSENSITIVE = Optimize[VOWELS_Y | "A" | "E" | "I" + | "O" | "U" | "Y"]; +export CONSONANTS = Optimize[bytelib.kLower - VOWELS]; +export CONSONANTS_INSENSITIVE = Optimize[bytelib.kAlpha - VOWELS_INSENSITIVE]; + +# LSEQs that can be used for URL verbalization for all languages; +# mainly protocol names & file extensions. +export URL_LSEQS = Optimize["www" | "edu" | "ftp" | "htm" | "html" | "imdb" | + "php" | "asp" | "aspx" | "bbc" | "cgi" | "xhtml" | + "shtml" | "jsp"]; + +# Rule for swapping cardinal to decimal; useful for measures where +# both can appear in the proto but may be handled similarly. +export CARDINAL_TO_DECIMAL = Optimize[ + CDRewrite["cardinal" : "decimal", "", "", sigma_star] @ + CDRewrite["integer:" : "integer_part:", "", "", sigma_star] +]; + +export escape_quotes_and_backslashes = + ((bytelib.kBytes - "\"" - "\\") | ("\"" : "\\\"") | ("\\" : "\\\\"))* +; + +## Generally useful definition: + +export hours = + "0" + | "1" + | "2" + | "3" + | "4" + | "5" + | "6" + | "7" + | "8" + | "9" + | "10" + | "11" + | "12" + | "13" + | "14" + | "15" + | "16" + | "17" + | "18" + | "19" + | "20" + | "21" + | "22" + | "23" + | "24" +; + +export hours_shift = + ("0" : "1") + | ("1" : "2") + | ("2" : "3") + | ("3" : "4") + | ("4" : "5") + | ("5" : "6") + | ("6" : "7") + | ("7" : "8") + | ("8" : "9") + | ("9" : "10") + | ("10" : "11") + | ("11" : "12") + | ("12" : "13") + | ("13" : "14") + | ("14" : "15") + | ("15" : "16") + | ("16" : "17") + | ("17" : "18") + | ("18" : "19") + | ("19" : "20") + | ("20" : "21") + | ("21" : "22") + | ("22" : "23") + | ("23" : "24") + | ("24" : "1") +; + +export hours_24_to_12 = + ("0" : "12") + | "1" + | "2" + | "3" + | "4" + | "5" + | "6" + | "7" + | "8" + | "9" + | "10" + | "11" + | "12" + | ("13" : "1") + | ("14" : "2") + | ("15" : "3") + | ("16" : "4") + | ("17" : "5") + | ("18" : "6") + | ("19" : "7") + | ("20" : "8") + | ("21" : "9") + | ("22" : "10") + | ("23" : "11") + | ("24" : "12") +; + +export hours_24_to_12_next = + ("0" : "1") + | ("1" : "2") + | ("2" : "3") + | ("3" : "4") + | ("4" : "5") + | ("5" : "6") + | ("6" : "7") + | ("7" : "8") + | ("8" : "9") + | ("9" : "10") + | ("10" : "11") + | ("11" : "12") + | ("12" : "1") + | ("13" : "2") + | ("14" : "3") + | ("15" : "4") + | ("16" : "5") + | ("17" : "6") + | ("18" : "7") + | ("19" : "8") + | ("20" : "9") + | ("21" : "10") + | ("22" : "11") + | ("23" : "12") + | ("24" : "1") +; + +export minutes = + "0" + | "1" + | "2" + | "3" + | "4" + | "5" + | "6" + | "7" + | "8" + | "9" + | "10" + | "11" + | "12" + | "13" + | "14" + | "15" + | "16" + | "17" + | "18" + | "19" + | "20" + | "21" + | "22" + | "23" + | "24" + | "25" + | "26" + | "27" + | "28" + | "29" + | "30" + | "31" + | "32" + | "33" + | "34" + | "35" + | "36" + | "37" + | "38" + | "39" + | "40" + | "41" + | "42" + | "43" + | "44" + | "45" + | "46" + | "47" + | "48" + | "49" + | "50" + | "51" + | "52" + | "53" + | "54" + | "55" + | "56" + | "57" + | "58" + | "59" +; + +export round_minutes = + ("1" : "0") + | ("2" : "0") + | ("3" : "5") + | ("4" : "5") + | ("6" : "5") + | ("7" : "5") + | ("8" : "10") + | ("9" : "10") + | ("11" : "10") + | ("12" : "10") + | ("13" : "15") + | ("14" : "15") + | ("16" : "15") + | ("17" : "15") + | ("18" : "20") + | ("19" : "20") + | ("21" : "20") + | ("22" : "20") + | ("23" : "25") + | ("24" : "25") + | ("26" : "25") + | ("27" : "25") + | ("28" : "30") + | ("29" : "30") + | ("31" : "30") + | ("32" : "30") + | ("33" : "35") + | ("34" : "35") + | ("36" : "35") + | ("37" : "35") + | ("38" : "40") + | ("39" : "40") + | ("41" : "40") + | ("42" : "40") + | ("43" : "45") + | ("44" : "45") + | ("46" : "45") + | ("47" : "45") + | ("48" : "50") + | ("49" : "50") + | ("51" : "50") + | ("52" : "50") + | ("53" : "55") + | ("54" : "55") + | ("56" : "55") + | ("57" : "55") +; + +export unrounded_minutes = + ("0" : "0") + | ("5" : "5") + | ("10" : "10") + | ("15" : "15") + | ("20" : "20") + | ("25" : "25") + | ("30" : "30") + | ("35" : "35") + | ("40" : "40") + | ("45" : "45") + | ("50" : "50") + | ("55" : "55") +; + +export round_minutes_next_hour = + ("58" : "0") + | ("59" : "0") +; + +export subtract_from_60 = + "30" + | ("31" : "29" ) + | ("32" : "28" ) + | ("33" : "27" ) + | ("34" : "26" ) + | ("35" : "25" ) + | ("36" : "24" ) + | ("37" : "23" ) + | ("38" : "22" ) + | ("39" : "21" ) + | ("40" : "20" ) + | ("41" : "19" ) + | ("42" : "18" ) + | ("43" : "17" ) + | ("44" : "16" ) + | ("45" : "15" ) + | ("46" : "14" ) + | ("47" : "13" ) + | ("48" : "12" ) + | ("49" : "11" ) + | ("50" : "10" ) + | ("51" : "9" ) + | ("52" : "8" ) + | ("53" : "7" ) + | ("54" : "6" ) + | ("55" : "5" ) + | ("56" : "4" ) + | ("57" : "3" ) + | ("58" : "2" ) + | ("59" : "1" ) +; + +export any_month = + (("0" : "")? + ( + "1" + | "2" + | "3" + | "4" + | "5" + | "6" + | "7" + | "8" + | "9" + )) + | "10" + | "11" + | "12" +; + +export any_day = + (("0" : "")? + ( + "1" + | "2" + | "3" + | "4" + | "5" + | "6" + | "7" + | "8" + | "9" + )) + | "10" + | "11" + | "12" + | "13" + | "14" + | "15" + | "16" + | "17" + | "18" + | "19" + | "20" + | "21" + | "22" + | "23" + | "24" + | "25" + | "26" + | "27" + | "28" + | "29" + | "30" + | "31" +; + +## TODO: These rules need to be coordinated with the markup since that may +## change. + +export approximately = "[~~]"; + +## Rounded: say "approximately". + +approx1 = Optimize[ + "minutes:" + ("" : approximately) (minutes @ round_minutes) + "|" + "hours:" + hours + "|" + pass_anything] +; + +## Rounded to next hour. + +approx2 = Optimize[ + "minutes:" + ("" : approximately) round_minutes_next_hour + "|" + "hours:" + hours_shift + "|" + pass_anything] +; + +## Not rounded: don't say "approximately". + +approx3 = Optimize[ + "minutes:" + (minutes @ unrounded_minutes) + "|" + "hours:" + hours + "|" + pass_anything] +; + +export approx = Optimize[ + approx1 | approx2 | approx3 +]; + +# "|" and "\" are escaped in the new serialization scheme using a backslash, so +# we need to adjust these in the verbatim mappings. + +func EscapedMappings[raw_mappings] { + escapes = ("\\\\" : "\\") | ("\\|" : "|"); + return Optimize[ + ((Project[raw_mappings, 'input'] - Project[escapes, 'output']) | escapes) + @ raw_mappings + ]; +} + +# Allows verbatim grammars to be more permissive by accepting all inputs, it +# simply consumes the input if it is not present in the raw mappings. + +func ConsumeUnmapped[raw_mappings] { + unmapped = bytelib.kBytes - Project[raw_mappings, 'input']; + return Optimize[ + D[unmapped]<20> + ]; +} diff --git a/third_party/chinese_text_normalization/thrax/testcase_cn.txt b/third_party/chinese_text_normalization/thrax/testcase_cn.txt new file mode 100644 index 000000000..688456e86 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/testcase_cn.txt @@ -0,0 +1,54 @@ +一 +二 +三 +四 +五 +六 +七 +八 +九 +十 +十九 +二十 +二十八 +三十 +三十七 +四十 +四十六 +五十 +五十五 +六十 +六十四 +七十 +七十三 +八十 +八十二 +九十 +九十一 +一百 +一百零一 +一百一 +一百一十二 +两百二 +二百二十三 +三百三 +三百三十四 +四百四 +四百四十五 +一千五百五 +两千五百五十六 +三千六百六 +四千六百六十七 +五千七百七 +六千七百七十八 +七千八百八 +八千八百八十九 +九千九百九 +九千九百九十一 +二零一九年九月十二日 +两千零五年八月五号 +八五年二月二十七日 +公元一六三年 +零六年一月二号 +有百分之六十二的人认为 + diff --git a/third_party/chinese_text_normalization/thrax/testcase_en.txt b/third_party/chinese_text_normalization/thrax/testcase_en.txt new file mode 100644 index 000000000..b5c1312b3 --- /dev/null +++ b/third_party/chinese_text_normalization/thrax/testcase_en.txt @@ -0,0 +1,14 @@ +23,000 +1980 +8:35 +14.5 +1/4 +2/15 +5% +$10086 +www.google.com +5:50 a.m. +4:30 PM +www.interspeech.edu +www.iccasp.net +jiayu@gmail.com