
Noritaka OSAWA and Toshitsugu YUBA
Graduate School of Information Systems
The University of Electro-Communications
1-5-1 Chofugaoka, Chofu-shi, Tokyo 182, Japan
{osawa,yuba}@is.uec.ac.jp
This paper proposes and evaluates a character or symbol code system called EPICS for internationalization of the WWW. EPICS integrates a variable-length coding system using 16-bit units and a smart virtual machine that executes inputs as instructions and is dynamically customizable. EPICS enhances the interchangeability of data. The variable-length coding system provides a huge code space. This huge space can include not only standardized code sets but also user-specific codes. The smart virtual machine allows us to define and modify instructions during runtime. Customization makes it possible for a sender to express his intentions in data and for a receiver to process the data depending on his needs. This customization also enables one to send compressed data and decompression programs incrementally and efficiently without predefined decompression algorithms. The length of an English document encoded in EPICS is shorter than that in UCS-2. The length of a Japanese and English document in EPICS is shorter than that in UTF-8.
Use of the World Wide Web (WWW) is becoming wide spread. The WWW is used by people in a lot of nations and the number of WWW users is growing rapidly. Therefore multilingual processing has become more important. In addition to scientists and engineers, a lot of people use it as a media for exchanging information. Business users use the WWW on not only the Internet but also intranets. On intranets, company-specific or personal symbols are needed in order to communicate with each other efficiently. It is desirable that those symbols can be exchanged with people outside intranets. There are problems to be solved.
Unicode[16] and ISO 10646[6] are expected to promote the handling of a lot of characters that have been standardized. However, we think that static character code sets like Unicode are not sufficient for internationalization of the WWW and the multilingual WWW. Existing character code standards intentionally avoid the specific handling of private or personal characters or symbols. They specify only code regions of private characters. Thus existing standards do not promote the international circulation of data to support humane studies and interdisciplinary studies which use user-specific symbols. However more and more researchers in those fields of study are using the WWW. Therefore a new framework to process and exchange user-specific symbols easily is needed since standardization of user-specific symbols is impractical. The framework should not require centralized registration. We chose a method that decreases the possibility of overlapping code points by using a huge code space.
This paper proposes a dynamic symbol (character) code system capable of handling general symbols in addition to currently used characters. It is called EPICS (Efficient, Programmable and Interchangeable Code System). EPICS is programmable and is also a universal symbol code system that enables us to exchange data efficiently and flexibly. Programmability of EPICS enables us to exchange compressed WWW data without a special decompression program. It will be shown that EPICS can be more efficient than UCS-2 in English text and can be as efficient as UTF-8 in text which includes Japanese and English. Not only characters in plain text but also tags in rich text can be included in EPICS. In this paper, a character and a symbol represent the same thing.
EPICS is a symbol (or character) code system that integrates a variable-length (multi-byte) code system called EPIC (Extensible Process-Internal Code)[12], whose unit is 16 bits, and a smart virtual machine[14] called EpicVM.
EPIC was originally designed to be used in an easy-to-use programming language that handles multilingual characters. When the programming language interpreter system was developed, 16 bit wide characters were not as popular. Therefore EPIC was designed for internal use. However, 16 bit wide characters are becoming popular because of the wide character (wchat_t) in the C programming language [7] and Unicode. Although a symbol in EPICS is a multi-byte character, EPICS can be used efficiently not only as codes for exchange but also internal processing because of the encoding design of symbols.
EpicVM is a smart virtual machine whose instructions are customizable dynamically. When we proposed PivotVM[14], we categorized it into a smart virtual machine. A smart virtual machine is a generic term and does not represent a specific virtual machine.
EPICS provides a framework where not only standardized character code sets but also symbols for research and user-specific symbols can be included without overlapping code points. Various types of symbol processing like sorting and searching can be done using a general software tool in the framework. For example, if one writes a text searching program for EPICS, the program can handle both standardized symbols and user-specific symbols. Special tools for ancient and user-specific symbols are not needed. EPICS reduces the work necessary for making software tools for symbol processing.
EPICS pays serious attention to both intentions of an information sender and requirements of a receiver. The sender can use arbitrary symbols and specify alternatives for these arbitrary symbols in EPICS. In other words, a sender can send his intentions to a receiver. The receiver can normalize data depending on his needs. The receiver may use alternative symbols that are specified by a sender, or may ignore the alternatives and map them to other symbols. We think normalization depends on the users' requirements. A single canonical mapping as in Unicode is not suitable in all situations.
EPICS allows a user to define a code sequence at a code point. When a symbol is inputted, a specified code sequence is invoked. For example, if a user specifies normalization of external user-specific symbols, the inputted external symbols are converted to normalized symbols. Not only mapping of 1 symbol to 1 symbol but also mapping of 1 symbol to a string is possible. This function accomplishes naturally the expansion of compressed data using dictionary-based coding like the LZ78 algorithm[17] if a routine that generates a string is specified at a code point. EpicVM can not only expand a symbol code to a string but also support more general programming because it is a virtual machine. By utilizing EpicVM, symbol images, font images and so on can be defined and transferred.
A unit of EPICS is a 16 bit long or wide character. A wide character in the C programming language and Unicode is becoming more and more popular. Processing of 16 bit characters is not a problem now.
We refer to a unit of 16 bits as EPICU. The most significant bit is
BIT 16 in EPICU and the least significant bit is BIT 1. The two most significant
bits in a unit indicate if the unit is the head of a symbol or the tail
of it. If BIT 16 is 0 in an EPICU, the EPICU is the tail of a symbol. An
EPICU whose BIT 15 is 0 is the head of a symbol. If both BIT 16 and 15
of an EPICU are 0, the EPICU is a symbol itself. This coding makes locating
boundaries of a symbol easy and efficient. We show the format of EPICU
in Table 1. Table 2 shows character
formats composed of between 1 and 3 units. Figure 1
also shows extension methods of EPICS.
| MSB | LSB | |||||||||||||||
| smoke fetish archive Avril lavigne pussy genie in a string bikinie cast Girls fuck horses stories pornstars xxx babes Nikki schieler ziering nude dickies jeans Skinny teen bitches hentai sex videos free Little girls sex nude pot Boob tube High school musical vanessa nude pics free artistic nude photos pussy prowler Gayforit watch free lesbian videos free teen celebs nude Bikini dare galleries spiritual sport fucking Big and beautiful porn stars faked nude Sleep nude girls nature nude video Catfight nude nude asian sluts nude ladies pussy Sara roemer nude asian girls non nude School girls nude pics gay nude workout Nude nudist pamela anderson playboy nude pics Kathrine heigel nude Free nude hentai fairies nude swedish blonde nude Nude celebrity videos for free nude celebrities miley cyrus nude south african men Free hardcore nude nude cellphone pictures Bollywood nude boobs blowjobs gallery Jim hardick free porn videos no credit card Creatine sex milfporn star aluminium essex Free video sex positions ultrapasswords xxx Xxx teen britney spears blowjob video Facts teenagers curfews natural hairy pussy Amature women Edwin carungay fuckyourtube sexproadventures Free kinky sex tips rave sex porn lyrics sexy back Better than sex cake recipe final fantasy rikku xxx Paris hilton blowjob free yno sex video 3d young art sex phone web cam sex Amature woman sex party free home-made sex clips young sex in america Free dirty sex pictures best sfrican sex movies He she sex pics picts of amature sex Julie michaels sex scene bible view on sex Sex tv tv show Extreme insertion sex 6 fee animal sex vids sex girls piss tube Thai pussy sex porn sex 3d fantasy pics sex mature woman jokes Jeremiah birthday sex bio tulsa police sex registration Sex vedeo stream chat independent sex scenes Racist daughter sex clearanced sex toys K9 sex clips britney sex movies black sex squirt Awsome hard sex manson sex onstage Nimpho sex classifieds sex offenders index Nomid animal sex hardcore lezbo sex Oral sex possitions Out sex videos sly fox sex famos toon sex Only ebony sex anette dawn sex extent sex pill Mature hairy sex asian sex french Kim kardishan sex education research group Ari banerjee yankee group ancestry group Randy orton group free group sex porno group insurances Galleon group hedge fund rubber fab technologies group Attorneys group group b infection Risk retention group insurance the rules support group Green resources group Group dynamics team r46b group high five amateurs group Amazing group sex on demand color group lesbian group gallery Campy centaur group accept group Group of deer is called fucking machine xxx College sex xxx chobits xxx Iran xxx sexo xxx enanas collection xxx Eve angel xxx pork xxx Older women xxx download free psp xxx Xxx sluts videos swingers xxx free Free bi xxx Photos xxx free harecore xxx xxx porn passwords Rapes xxx xxx adult dvd xxx gratis con putas Web site xxx free xxx mangas Alena seredova xxx ball dragon porn video The thrills music video woman squirting free video Roma video card e pci video mtv jam video Apartment mikes picture video paris hilton video stills Big cock homemade movie council meeting video Studio telescope video converter ipod ora video Victoria pink videos Uk movies cussler movie new video releases 2005 Conferencing live video violence video games children tasha nelson video Rv video camera movie graber Adam sandler secret video teacher sex crazydumper Sex health video marriage with sex Celebrity sex viceos busty office sex shove bull sex Football sex rio free sex shots Consensual submission sex free sex gemes Mauritius sex site hardcore sex mp3 Barbarella sex machine Hunting sex jessica alien sex gaems free sex xxx Muscular sex pictures ass booty sex dogpound group sex Anail sex videos vitamins before sex Brewster sex stories asians sex Haveing sex with a man lesbien sex xxx Hypno girl sex arabic sex 9356 biker girls sex Guilty gear sex mature free sex tube Nude girls having sex with boys ray j and kim kardashian full sex tape for free Cyber sex forum what is angry sex Sex while pregnant pictures When can i have sex and not get pregnant home made amateur sex tapes dog sex beastality Sex games online for women clips cartoon sex taboo charming mother sex Girl sex pose hardcore gothic sex Best sex teacher love sex relatioships Historical books sex pegging sex literature Sex story community sex bites torrent long sex trailors Gonzo rawr sex carrie bradshaw sex Voung teen sex home sex stream Kinky sex forum savvanah gold sex Anal sex wide Crushing for sex comic sex jokes mermaid sex videos Pet sex foram ali sheffield sex cancer sex partners Calforina sex retreat mini teens sex Anal sex cum victorian xxx Xxx sci fi sexy photos xxx Xxx video play xxx babe videos animail xxx All xxx tube tilf 2 xxx Xxx puzzle black porno xxx 3gp xxx wap videos streaming xxx Free xxx moves Muscle gay xxx free gothic xxx video naruto xxx Xxx pass free best xxx movie 2008 xxx dog clips Xxx free e cards xxx porn full videos Xxx stone porn movie theaters Morgan lane porn catherine porn Porn mom son sex mommy and daddy porn kasumi porn Find porn torrents rumania porn Xxx pictures porn black porn videos free Discipline porn biggest penis porn Littel girl porn Porn leg warmers tiny tits porn movies top 10 porn clips Free lovemaking porn homemade mexican porn vanessa raia porn Muslim porn sex free high definition porn streaming James nichols gay porn fuck me gay Vulva fuck sexy fuck movie Mother lets son fuck her fuck you mom and dad mommy fuck son Father son fuck girl porn to fuck Fuck off letter fuck my boob Megaupload fuck i fuck my mother inlaw Doggy style fuck videos Woman looking to fuck shemales fuck girls movies kama sutra fuck Fuck you love mother daughter fuck boyfriend fuck church Dog fuck woman movies the fuck buttons Man fuck his dog Blowjob And Cum Swallow mom giving son blowjob Preggo Blowjob free blowjob compilations blowjob mature Blowjob Guys blowjob fantasies 18 Avatar Blowjob sister gave me a blowjob Tickling Blowjob blowjob at school Hentai Porn Blowjob Fake Blowjob girl pukes during blowjob blowjob tryouts Guys Blowjob japanese girl giving blowjob most famous blowjob Gay Horse Blowjob double blowjob vids Blowjob Outdoor Youngest Girl Porn Ever plus size sexy school girl Flavor Flav Girl Poops all girl sex videos girl porche Baby Girl I Want You gossip girl on tv com Hey Hey Baby Will You Be My Girl naked girl shitting Little Girl Photos ghetto black girl Go Go Girl Adult Girl Psp Theme girl for sale on ebay pin up girl hats Little Monster Girl naked teen girl pics black girl actress Sleeping Girl Gets Raped how to approach a girl online Girl And Girl Haveing Sex Ink bitch webbie gutta bitch Lyrics to five star bitch bitch in french Badd bitch quotes cant trust no bitch Bitch asian im a pretty bitch Kristen stewart is a bitch a bitch slap G unit fat bitch Shut up bitch download im in san diego bitch cock hungry bitch Teeh fuck the bitch is kristen stewart a bitch bitch milfs Lyrics to bitch by meredith brooks foot fetish bitch Shake that ass bitch and let paris hilton beach sex Cocksucker snake girls xxx Nude booty poppin little teens pics most extreme porn list Audience analysis heather locklear nude Porn star named madison lolita preteens Cheyanne bride black cock joelle amateur Nude christina aguilera Nice nude teen photo gallery hot cab mature sex sites Fucked by my dog mpegs massive tits men fucking boys Swedish porn galleries amateur nudes Sexy superheroes bbw nude women Nude pussy cum naomi nude Nude asian americans courtney smith nude sienna guillory nude Girls basketball nude kate bosworth nude fakes Amateur wife nude photos ukraine nude teen Big black ass nude kiera knightley nude pics Nude russians Sleep nude chris brown rihanna nude photos pic of nude girls Bollywood nude images sexy and nude pics free nude college girl videos Nude dads and daughters ameture nude pictures Serena williams nude pix 1st Anal Sex what is an anal prolapse Types Of Anal Sex gay anal sex technique gay anal fisting videos Why Does Anal Sex Feel Good video double anal Lesbian Teens Anal largest anal dildo Lesbian Anal Toy anal sex poop videos Anal Hidden Cam Amateur Interracial Anal amy amour anal how to anal intercourse Anal Sex Condoms eyaculacion anal free anal streaming Anne Hathaway Loves Anal mini anal Unnatural Anal Insertions Anal Guest free full anal movies Manual Anal 1st anal video shits herself anal Couple Anal Sex roxy renolds anal Sara Jay First Anal Scene anal destruction casedy Como Hacer El Sexo Anal anal sex effects Anal Cancer Blog Anal Toys Lesbian ice la fox anal scene lesbian anal vid Rough Anal Sex Clips wet anal double anal sex movie Palin Anal really painful anal Shitty Anal Fuck rodox sex mpg Shower sex how penis breasts sex Sex malam pertama random sex videos exsplicit sex videos Sex lubrication silicone i post sex Sex fat chick celebriies having sex Adult sex animations sex and motorcycles Adult sex therapy Laura cover sex fucking having sex sex vacation caribbean Pool sex orgasm women barbershop sex office sex gay Secretaire office sex black sex vod Rainbow mika sex Rock cock jock cock robin when your Wife big cock huge cock free pics Mature sucking black cock cock docking clips Hardcore riding cock cock sucking whores Fuck you cock sucker cock fighting rules Big cock hardcore Hubby loans to black cock milf sucking young cock two cock in pussy Cock sucker t shirt two cock fucking cock pierced Tila tequila suck cock largest cock videos White teen black cock miss teen usa south carolina Fucking boobs thumbnails free videos of gay black me gandbang Senior sex trailer sophie monk nude nude music videos Britney spears porn video maggie grace nude Preteen bikini movies xxx Sexy pamela anderson vanessa new nude photos Aisha tyler nude pics Gametophyte produces male female sex mate plants toothless blowjob monthly membership streaming porn Pinkpanteens preteens in thongs lingerie nudecollege students Fat mature sex teen monologues Ebony muff diving sex with hookers Free jaybee sex sex with redheads Cartoons about sex usa sex forum retarted girls sex Photo booth sex gay virgin sex Female sex chromosome sex teen candy Teenage sex story sex feet tingle Celebrity sex sces Flex girl sex lesbian sex galerii work at sex Rough sex free roug gangbang sex hypnosis sex best Sex trek 6 teens wating sex Ssecretary sex videos 1st Anal Sex what is an anal prolapse Types Of Anal Sex gay anal sex technique gay anal fisting videos Why Does Anal Sex Feel Good video double anal Lesbian Teens Anal largest anal dildo Lesbian Anal Toy anal sex poop videos Anal Hidden Cam Amateur Interracial Anal amy amour anal how to anal intercourse Anal Sex Condoms eyaculacion anal free anal streaming Anne Hathaway Loves Anal mini anal Unnatural Anal Insertions BIT position |
16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 |
| Tail EPICU | 0 | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X |
| Head EPICU | X | 0 | X | X | X | X | X | X | X | X | X | X | X | X | X | X |
Figure 1: Relationship between Most Significant
Bits and Symbol Length.
If BIT 16 is 1, each point in EPICU has successive units. If BIT 16
is 0, there are no more units.
Locating boundaries of a character is important in editor and viewer programs. In multi-byte codes of ISO 2022, it may be impossible to distinguish whether a byte is the first byte or the last byte in a 2 byte code on the basis of only the data of the byte. Incremental confirmation is needed from a confirmed point in the worst case. In EPICS, a header unit, an intermediate unit or a tail unit can be easily distinguished on the basis of data of the unit alone.
EPICS pays attention to string matching. Existing string matching algorithms
can be naturally applied to data encoded in EPICS when a unit is 16 bits.
Special handling depending on the length of a code is not needed. Pattern
matching using regular expressions can also be applied easily where 16
bit data is one unit.
Some people who have made programs that handle ISO 2022 believe that the use of variable-length codes makes programming difficult. However, the main reason for the difficulty of handling ISO 2022 is not variable length but state management of ISO 2022 characters. Handling of ISO 2022 needs extra state management because a code point is multiplexed by different code sets. EPICS assigns different symbols to unique code points and thus does not require extra state management.
In the C++ programming language, 'a smart pointer' [15] helps C/C++ language programmers write programs that handle EPICS in the usual way. A smart pointer makes it possible to use EPICU in the C++ language like 'char' type in the C programming language. From our experiences when variable-length codes and smart pointers are used to make multilingual programming (script) language systems[12][13], handling of EPICS using smart pointers is as easy as that of fixed-length codes. In languages that do not allow pointer arithmetic, like the Java language[3], programmers do not need to be aware of the length of a character code.
Variable-length coding using 16-bit units makes a very huge code space available. A huge code space with variable-length coding makes overlapping of code points of user-specific symbols less likely. Even if a registry administration of symbols does not exist, the possibility of overlapping code points would be made sufficiently low by using a sufficiently long code value and an appropriate hashing function that determines the prefix part of a code value.
We do not think surrogate characters in Unicode expand a code space sufficiently. One million code points made by surrogate pairs are too few to keep user-specific symbols from overlapping and interchangeable without explicit coordination.
A symbol code space of EPICS can be divided into subspaces. There are standardized character set subspaces, EpicVM subspaces, user-specific subspaces and temporary use subspaces. Symbol code values composed of one or two EPICUs are used for standardized characters and EpicVM instructions. 3-EPICU symbols are reserved for future standardized characters. Symbol code values composed of 4 or more EPICUs can be utilized for user-specific or temporary symbols. However, we recommend the use of symbols whose length is 5- or more EPICU for user-specific symbols.
Following Unicode standard, the character code value of Unicode is represented by U+nnnn where nnnn is a four digit number in hexadecimal notation. A symbol code value of EPICS is represented by "P+" and 4-digit hexadecimal numbers with dots as separators. For example, an EPICS symbol composed of 1 EPICU is represented by P+nnnn, and a 2-EPICU symbol is represented by P+mmmm.nnnn.
Some parts of EPICS are based on Unicode. Lower code values of Unicode
are identical to code values of EPICS except unified CJK (Chinese, Japanese
and Korean) misc. characters. The relationship between Unicode and EPICS
is shown in Table 3 and Figure
2. For example, codes between U+0000 and U+2FFF correspond to codes
between P+0000 and P+2FFF respectively, and the code region between U+3000
and U+3FFF are mapped to P+8000.7000 and P+8000.7FFF.
| Unicode range | EPICS range |
| U+0000 -> U+2FFF | P+0000 -> P+2FFF |
| U+3000 -> U+3FFF | P+8000.7000 -> P+8000.7FFF |
| U+4000 -> U+7FFF | P+8001.4000 -> P+8001.7FFF |
| U+8000 -> U+BFFF | P+8002.4000 -> P+8002.7FFF |
| U+C000 -> U+D7FF | P+8003.4000 -> P+8003.57FF |
| Surrogate Pairs | P+9800.4C00 -> P+9B00.4FFF |
| U+E000 -> U+FFFD | P+8003.6000 -> P+8003.7FFD |
Character code sets registered at ECMA (European Computer Manufacturers'
Association) based on ISO 2022[5] are also mapped
into EPICS for compatibility. The value of a final character to designate
a coded character set is added to P+8100, and the result is used as the
prefix of a symbol. Examples are shown in Table 4.
Although ISO 2022 based characters can be included in EPICS strings, we
recommend the use of mapped versions of Unicode characters instead of mapped
versions of ISO 2022 based characters unless special intentions are involved.
| ISO 2022 | EPICS | |
| Character Set | Final Character | prefix |
| JIS X 0208 | 4/2 | P+8142 |
| CNS 11634-1 | 4/7 | P+8147 |
The code region between P+3000 and P+3FFF is used and reserved for EpicVM instructions and integer representation. EpicVM will be described in the next section.
The code region between P+3000 and P+3CFF is available for user-defined EpicVM instructions. Not only a code point in that region but also a code point in other unused regions can be used for a user-defined EpicVM instruction, however, unassigned code points of 1-EPICU symbol exist only in the above code region. The code region between P+3D00 and P+3DFF is used for exception handlers. The code region between P+3E00 and P+3EFF is used for predefined EpicVM instructions.
The code region between P+3F00 and P+3FFF represents the range of integers
between -128 and 127. Integer representation can be extended to hold a
larger value based on Table 5 and Table
6.
| Integer | EPICS range |
| 8-bit signed integer (8 bits) | P+3F00 -> P+3F7F |
| 22-bit signed integer (8+14 bits) | P+BF00.4000 -> P+BFFF.7FFF |
| 36-bit signed integer (8+14+14 bits) | P+BF00.C000.4000 -> P+BFFF.FFFF.7FFF |
EpicVM is a smart virtual machine and is also a stack-based virtual machine. It is a new type of virtual machine. EpicVM decodes an input symbol as an instruction and executes it. EpicVM allows one to define or modify its instructions using instructions that have been defined during runtime. On the other hand, a usual virtual machine like Smalltalk bytecode machine[2] and Java virtual machine[8] have a fixed instruction set, and they do not allow one to change instructions dynamically.
The internal structures of EpicVM are shown in Figure 3. EpicVM has a small number of registers. They are an input code register, an output code register, a stack pointer, a frame pointer and a current offset pointer. EpicVM has a data stack that a program manipulates. A unit on the stack is a symbol whose length is variable. This is different from other usual stack-based machines.
Each code point has a maximum of 128 attributes. Each attribute can
contain a symbol or a code sequence (a routine). Attribute 0 of a symbol
is usually used to store a code sequence to be invoked when the symbol
is inputted.
EpicVM allows one to define a sequence of program codes at a code point. Jumps in the sequence are restricted to relative jumps. Absolute jumps cannot be made on EpicVM. The range of a relative jump must be within the defined sequence. If the target address of a jump is out of range, an exception is raised. An exception causes a corresponding exception handler to be invoked. An exception handler is defined at a fixed code point. A user can define the exception handler. Codes in a defined sequence may be instructions. In other words, instructions at a code point can call already defined instructions. This makes it possible to invoke instructions as functions or procedures without absolute jumps. When an instruction is invoked, registers are saved on a system stack. Saved values are restored to the registers when control returns from the instruction.
Most instructions of EpicVM are general in a stack-based virtual machine like Smalltalk-80 bytecode machine[2] or Java virtual machine[8]. However, instructions to define or modify an instruction or an attribute are specific to a smart virtual machine like EpicVM. Basic instructions includes add, sub, compare, branch, push-in, push-sp, push-fp, put, get, define and so on. Add, sub and compare represents addition, subtraction and comparison of two values on the stack respectively. branch is a relative-jump instruction. Push-in, push-sp and push-fp represent pushing the value of input register, stack pointer and frame pointer onto the stack respectively. Put and get are instructions to put and get an attribute at a code point respectively. Define is an instruction to define a new instruction. The general format to define a new symbol or instruction is as follows.
define <symbol-code-value> <length-in-byte> <code-string>
Let us define a string "EpicVM" at P+3120. The code sequence
to define the string is shown in Table 7. When P+3120
is inputted after this definition, the code P+3120 is expanded to "EpicVM".
When an input symbol is not defined as an instruction, a default handler is invoked conceptually. A default handler is defined at a fixed code point (P+3DFF). In plain EpicVM, code sequences are not defined at code points except for EpicVM instructions and integer representations. The default handler simply passes the input symbol to the output. Conceptually the default handler contains the following code sequence.
push-in pop-out
The sequence pushes an input symbol to the stack and pops the stack top to the output. In an actual implementation, the above code sequence does not need to be executed. If EpicVM knows that default handler is unchanged and an instruction sequence is not defined at an input symbol code point, it may simply output the input symbol. In other words, the overhead of default processing of an input symbol is only to check if the symbol is defined or not. The overhead is very low because the checking can be performed using hashing, or with computational complexity of O(1). EpicVM does not slow down the processing of usual symbols at a client.
Use of variable-length codes may make the number of bytes per symbol longer. Under such conditions, data compression by defining codes in EPICS increases the density of data. A sender can choose an appropriate algorithm for data contents if the sender sends a decompression program with compressed data. For example, a sender can send a decompression program like LZ78[17] at the head of data and follow it with compressed data.
It is also possible for a sender to gradually send program fragments and compressed data that uses defined codes, and for a receiver to expand compressed data gradually. This method requires code definitions to be sent explicitly and its compression ratio may be worse than that of LZ78 when a decompression program is installed on the receiver side. However, using this method, one can choose an algorithm suitable for data. One does not need to send a program at the head of transmission but it is necessary to send a code definition just before the code is invoked. This method reduces the latency of recovering symbols from compressed data on a stream-type communication which protocols on the WWW usually use. Moreover, when transmission is aborted, this method can reduce the transfer of unused parts of a decompression program.
We have made a prototype program which compresses data written in EPICS,
and produces compressed data and incremental decompression routines in
EPICS. A draft (epics.txt) of this paper
written in English and HTML, and a manuscript about PivotVM[14]
(pivot-vm-j.txt) that includes Japanese
and English are used as sample texts. Table 8 shows
the length of compressed text in EPICS and the length of the text in other
formats. The length of epics.txt
encoded in EPICS is shorter than the length of the text encoded in UCS-2.
The length of pivot-vm-j.txt in
EPICS is shorter than the length in UTF-8. Although EPICS supports a huge
code space, EPICS is efficient. It is possible to exchange encoded data
efficiently without special decompression programs. Our compression program
is a prototype. We think the compression ratio could be improved if the
compression program is better tuned.
It is difficult to standardize ancient characters which are not used in daily life but are being studied. Examples of ancient characters are hieroglyphs in Egypt and pictographs in China. If researchers have different opinions about identities of symbols, standardization is impossible or at least difficult. If most researchers are able to agree with each other in the future, ancient symbols will be standardized. However, researchers can not wait for full standardization. EPICS allows researchers who have different opinions about the identification of symbols to assign symbols to different code points and proceed with their studies. Once standardization has been completed, an EpicVM in EPICS can be customized to map old code points to standardized ones. Data encoded in EPICS does not need special conversion software nor special searching software.
Unicode uses combining characters. This is partly because the code space size of Unicode is insufficient. If all combinations are defined, they do not fit the 16-bit code space. Therefore Unicode uses an incomplete repertoire of composite characters. EPICS can have an infinite code space size although a practical limit should be imposed. Every combination of combining characters can be assigned to a different code point in EPICS. When composite characters are used, locating boundaries of characters becomes simple. Moreover rendering of a composite character is more systematic than rendering of combined characters.
Documents usually consist of not only characters but also language tags and formatting tags. All of them should be internationalized. Standard General Markup Language (SGML) [4] and Hyper Text Markup Language (HTML) [1] are examples of use of tags or markups in fancy text. However, tags in HTML are based on English words. We do not think that internationalization of SGML and HTML is enough.
We can assign a symbol code value in EPICS to every tag which is composed of characters in markup languages because the code space of EPICS is huge. Internationalization of tags can be accomplished using EpicVM which maps the symbols to character strings in a user's native language. Tags in binary representation and a special transformation program could accomplish the same internationalization as EPICS. However, EPICS can accomplish internationalization of tags in the same framework as that of the characters. Since the code space of Unicode is not huge, it is difficult to treat a lot of tags in various markup languages and software as characters in Unicode. Therefore, string processing such as string matching needs special handling of tags, and thus it complicates a document processing system for internationalization. EPICS simplifies the processing system.
Unicode is important as a base for multilingual processing. However, the encoding of Unicode is static and inflexible. A character set based on ISO 2022 is specified by each nation's government. Therefore, separation between languages and characters is insufficient. Thus an identical character can have different code points in different code sets. ISO 2022 uses a small code space and switches code sets mapped to the space. This complicates state management of characters. ISO 2022 is unsuitable for the internal processing of characters. ISO 2022 was standardized when computer resources were limited. Designation and invocation are main controls. We do not need to restrict control capabilities of a character code system because computational power and memory capacity have been enhanced recently. We think a simple and smart virtual machine should be included in a character code system standard.
Arena i18n [9][10] uses fixed-length internal codes. The unit is 4 bytes. 4-byte fixed-length codes are easy to handle within a program. However, when a code is exported outside the program, that encoding is inefficient and code conversion is probably needed.
Internal codes of Mule (MULtilingual Enhancement to GNU Emacs) [11] are mainly based on ISO 2022. Thus separation between languages and characters is insufficient. The length of a code is variable and the unit is a byte. A character code in a character set is prefixed with information identifying the character set. This type of encoding does not allow one to apply existing matching algorithms for fixed-width encoding to data simply using a byte as a unit because the existing matching algorithms do not recognize boundaries of a variable-length character properly.
In this paper, we have presented a new symbol code system called EPICS. We think that EPICS promotes efficient internationalization and multilingualism of the WWW without imposing fixed character sets on people. Moreover, EPICS makes compressed data transfer possible without installing special decompression programs at clients. EPICS is derived from a unique combination of a variable-length coding system and a smart virtual machine, EpicVM.
In EPICS, a variable-length coding system makes it possible to include various characters needed for internationalization efficiently. The huge size of the code space of EPICS allows one to use and to exchange user-specific symbols with little possibility of overlapping code points even if coordination is not performed. EpicVM allows one to send not only static characters but also dynamic programs. This programmability enables one to send compressed data with a decompression program incrementally and efficiently. Compression reduces the amount of network traffic and storage overhead on WWW.