Mar 5, 2008

Baraha to UTF8 (Kannada) pre-release

There is an incomplete preview of my Baraha to UTF8 converter available at http://www.fontmatrix.net/kamathln/brh2utf8.tbz . I am grateful to Pierre Marchand of fontmatrix for providing me the space. My current logic is somewhat too generic and am falling into lot of troubles.

A session on #fonts opened my eyes to a good amount of resources on the internet on Indic text processing especially in unicode. I think I will be spending a lot of time on it for the next few days.


I will paste interesting parts of the chat transcripts down here (Resource links are available at the end if you are too busy to read the chat scripts):


kamathln: Anyone from India ?
kamathln: I want to know a good resource for parsing indic UTF8
Popolon: kamathln, fontmatrix ?
amathln Popolon: fontmatrix ?
Popolon can be used without gnome
kamathln what does that do ?
Popolon fontmatrix is a font manager
kamathln ok.. does it have a good parser ?
Popolon it works very well with every indians fonts I have installed
Popolon It classify fonts
Popolon write them well
Popolon which kind of parser ?
kamathln Popolon: your knowledge of indic scripts ?
kamathln Popolon: can you comprehense the words "half vowels" and "half charectors" "ottaksharas" ?
kamathln that kind of parsing .. :-P
Popolon limited, I know that's alpha-syllabic
Popolon scriptures
pierremarc Depending of what you want to do, look at Harfbuzz, ICU or m17n
pierremarc All deal with shaping non-latin scripts
kamathln pierremarc: already looking at ICU .. but I guess m17n might have what i want .. dont know what harfbuzz ..will check .. thx so much for your input
pierremarc kamathln: Can we know what you are working on?
pierremarc TBH, I?m in the process of preparing some Indic support in Fontmatrix
kamathln there is an old kannada encoding format "baraha". I am trying to convert from it to unicode..
pierremarc http://www.infitt.org/ti2002/papers/64RAGHUV.PDF - I have not read it yet but it could be of some interest
kamathln i have written a simple match and replace based converter.. but when there are multiple half consonants or half vowels attached to a main full consonant, the utf8 comes out wrong .. and i am trying to fix it using a utf8 parser
kamathln http://www.baraha.com/web_docs/glyph_codes_kan.htm
pierremarc Is it the rendering of the resulting UTF string that is wrong or the string itself?
kamathln is the baraha glyph map
kamathln unicode.org/charts/PDF/U0C80.pdf
kamathln is the unicode..
kamathln they dont match one to one
* kamathln checks http://www.infitt.org/ti2002/papers/64RAGHUV.PDF
kamathln pierremarc: string itself comes out wrong
pierremarc And what results give regular tools such as iconv?
kamathln pierremarc: iconv is for transliteration right ?
pierremarc encoding conversions
kamathln !
kamathln must check..
kamathln never knew about it..
* kamathln bangs head on the wall
pierremarc But since "baraha" seems to be a private mapping over ASCII, it can be possible that iconv can?t do nothing for you
kamathln pierremarc: guessed as much .. but still i must have known abt iconv which i didnt
pierremarc Here "iconv -l | grep annada" gives nothing :(
kamathln pierremarc: okie then.. my hard work is not a waste ..
kamathln i will put up my converter up somewhere and get back here ..
kamathln it will be under GPL
kamathln pierremarc: do you have indic fonts installed ?
pierremarc Yes, I must have ones :)
kamathln ???????????? kamathln pierremarc: cool.. did you see a charector with dotted circle ?
pierremarc nope
kamathln it must be the second character..
kamathln ?
kamathln this one ..
pierremarc But I can try with another font
kamathln okies.. no problem .. the point i am trying to make is my app is not generating the correct output
kamathln pierremarc: my main logic is quite straightforward ..you may find it usefull
pierremarc Have you read the paper written by Y. Haralambous about its work on Indica?
pierremarc I?s my starting point for my own work in FM
kamathln pierremarc: no.. iam kinda noobie in this ..
pierremarc Could you mail it to me and I push it on fontmatrix.net in a private area?
kamathln oh sure!
pierremarc pierremarc at oep-h.com
pierremarc kamathln: http://omega.enstb.org/yannis/pdf/indica-santa94.pdf
pierremarc http://www.w3.org/2002/Talks/09-ri-indic/indic-paper.html
pierremarc http://www.fontmatrix.net/kamathin/
pierremarc oops, let me change the "i" into an "l"
kamathln pierremarc: thx :-) .. I owe you
pierremarc http://www.fontmatrix.net/kamathln/
kamathln wow.. that indic paper was really usefull.. I think I am gonna do a lot of research and coding in the next few days :-0
pierremarc kamathln: If at some point you want to participate in this part of FM, you?ll be very welcome
kamathln pierremarc: thx .. i think i will.. though not immedietely .. currently busy searhcing for a job
kamathln pierremarc: fontmatrix is somehting i have been longing for .. God bless you..
Popolon kamathln, scim too ?
Popolon it includes uim + m17 interfaces as sub methods
pierremarc I got to go now. We?re aall a gang to hang on #fontmatrix channel, don?t hesitate to come
kamathln yeah.. but scim is a monster for a newbie like me ..
Popolon I don't know if it contains itself indian methods
Popolon ok
Popolon I use it for chinese typing
kamathln Popolon: it does .. thats how i wrote that kannada word if you scroll up a little :-)
Popolon as it's definitivly the best in this domain
Popolon sorry I was at phone
kamathln Popolon: yes.. and yudit too!
kamathln Popolon: you dont need to..



Then the discussion went a little offtopic and almost came back .. but it was fun :-)


i

Popolon pierremarc: doesn't render in my xchat display window, but in my xchat input text window
Popolon kamathln, that's used in srilanka tamoul ?
Popolon isn't it ?
Popolon I used yudit too before scim :)
kamathln srilana uses a bit of tamil
kamathln SriLanka*
Popolon it's fun for this handwritten hanzi recognition
kamathln Kannada is used in karnataka .. a south west part of India .. but not as south as Kerala
Popolon because at a shop, in my street, there is a tamoul and a srilanki they both speak tamoul
Popolon ah ok
Popolon I mixed the names because of the K...a.a :)
Popolon Don't know well the names of the states of india
Popolon The tamoul is from Pondicherry
kamathln wikipedia.org/wiki/kannada
kamathln it is Tamil
Popolon sorry Tamoul is in french
kamathln you mean you call Tamil "Tamoul" in french ?
kamathln cool :-)
Popolon yes
kamathln there is a state called "Tamil Nadu" in south east of India.. If you remember the map of India, the south of india is an upside down triangle surrounded by sea..
kamathln well.. i think it is better to look at a map than me explain ;-)
Popolon yes
Popolon I remember the shape of the india, look for maps in wikipedia
Popolon I know most of chinese province, next to learn india states :)
kamathln Popolon: cool.. u r a GK guy..
Popolon GK ?
kamathln General Knowldge
Popolon Puducherry => Tamil Nadu
Popolon ok
kamathln yes.. Pondicherry is in Tamil Nadu
Popolon Not really, In fact, I like, Indian & chinese culture :)
Popolon http://en.wikipedia.org/wiki/Geography_of_India#Political_geography
Popolon exactly what I need
kamathln http://upload.wikimedia.org/wikipedia/commons/b/bd/India-states-numbered.svg
kamathln is what I was abt to offer
kamathln which was wrong anyways.. :-P
Popolon http://en.wikipedia.org/wiki/Image:India-states-numbered.svg
Popolon hihi , that's the same file
Popolon wrong ? There are errors ?
kamathln 12 is karnataka
kamathln no.. i cant see the names of the states .. only numbers
Popolon http://commons.wikimedia.org/wiki/Category:Maps_of_India
kamathln 24 is Tamil Nadu
Popolon there are names in my link (en.wiki)
Popolon as this is SVG it's easy to add names in several languages
Popolon I worked at grouping svg files in SVG category tree on commons few month ago
Popolon will have to add
kamathln Popolon: cool
Popolon it's possible to add 'translation of this map' in the file page itself
Popolon this is the case in some biological svg
kamathln oh nice!
kamathln biological ? like "parts of the body " file ?
Popolon http://commons.wikimedia.org/wiki/Category:SVG
Popolon yes
Popolon parts of body
Popolon animals
Popolon etc...
kamathln oh nice ...
kamathln interesting ..
Popolon http://commons.wikimedia.org/wiki/Category:SVG_%E2%80%94_anatomy
Popolon there are numbered SVG
Popolon empty
Popolon and with translations
kamathln cool1
Popolon http://commons.wikimedia.org/wiki/Image:Baleen_Whale_Physical_Characteristics.svg
Popolon here an example, how to do other translations of this picture on the page
Popolon you can download SVG
kamathln nice concept .. might be usefull somewhere else
Popolon edit it with inkscape (free software)
Popolon http://inkscape.org
Popolon and upload the translated one
kamathln i know of inkscape..
Popolon cool
kamathln used it to create a few icons for my prev company
kamathln Popolon: i was wondering if we can create a new xml based font standard ..
Popolon kamathln, there is already SVG fonts :)
kamathln !
kamathln cool.. checking out
Popolon http://www.w3.org/TR/SVG11/fonts.html
kamathln already on it
kamathln :-)
Popolon they can use several layers
kamathln WoW
Popolon allowing multi-colored fonts (as in type 3)
Popolon this could be interesting for indian languages where some shapes are reused ?
Popolon in chinese there are about 50 basics caracters
* eimai [n=eimai@49.0-200-80.adsl-dyn.isp.belgacom.be] entered the room.
Popolon the (about 50000 ???) other are combination of these few
Popolon combining basic element with transformation matrix could save lot of memory :)
kamathln Popolon: abt Indian fonts : You have a point..
kamathln like ? and ?
kamathln and ?
kamathln all are similar
kamathln and even ? and ?
Popolon yes in french we have accents eéè
Popolon uüûù
Popolon about the same
kamathln ?????!
kamathln whups
kamathln yeah!
Popolon :)
Popolon could be 5000 not 50000 characters in chinese :)
Popolon 85 000 in some dictionnaries :)
Popolon but I think it's not 85 000 characters, but 85 000 words
kamathln Popolon: was actually wondering if "soundfonts" too could be encoded in a similar way for scripts like Indian where there is a one to one match between syllables and chars
Popolon (sometime combinaison from 2 to 4 characters
kamathln Popolon: Chinese seems to be a very complex language..
Popolon no, it's really easy language
kamathln Popolon: I wonder how they built computers at all !
kamathln Popolon: oh!
Popolon but writing is a little bit complex
kamathln Popolon: oh
kamathln lot of things happened today for me .. will be saving my chat log and blogging abt it..
Popolon there is no grammatical conjugation in chinese for example
kamathln Popolon: then ? how to percieve what is being spoken ?
Popolon the verbs are often composed of 2 simple verbs, often like in english
kamathln Popolon: ok
Popolon go in, go out, come in, come out, are the same in chinse :)
kamathln you mean one word for all of it ?
Popolon the guy that created esperanto use lot of chinese grammar because of it's simplicity
Popolon no like in english
kamathln ah! ok
Popolon 2 simple verbs to create a verb defining accent
Popolon action not accent :)
kamathln ah! simple once you catch the nack
* rahul_b left the room (quit: "Leaving(?????? ????)").
Popolon yes
kamathln hey! will svg fonts make OCRs simpler ?
kamathln can we add the "direction" charectors/ half characters are written in ?
Popolon ??= come in, ?? go in, ?? come out, ??, go out
kamathln cool .. thoise rendered nicely on my pc
Popolon :)
kamathln and what is this about "simplified chinese" and "traditional chinese" ?
Popolon ??man, ?= follow
kamathln wow! each char is a word ?
Popolon ?=arbre, ??wood, ??forest, often written ??
Popolon yes
Popolon often two chars are used together
kamathln a picture is worth a word :-))
Popolon as one char=1 syllabe, and there are two few different syllabs
Popolon this is for avoid confusions
Popolon ??arbre, ?=racine :)
Popolon sorry
Popolon ??tree, ?=root :)
Popolon the same before arbre in french = tree in english
kamathln ah!
Popolon ?=tree, ??wood, ??forest, often written ??
Popolon then
kamathln which means forest tree
Popolon ??little, ?=young
kamathln oh
Popolon in means forest, it's used if it's alone to avoid confusion
Popolon if you add some characteristic, you can use only the half
Popolon as in : ??
Popolon young wood
kamathln what if they want to write a word of another language ?
Popolon that's the name of the famous temple, ???shaolin :)
kamathln like they want to write "Laxminarayan"
kamathln which is incidentally my name
Popolon they translate it, or use prononciation similarities
Popolon Saddly, I don't know how it is pronounced
Popolon ??
Popolon I suppose rayan = something like this
kamathln LOL !
kamathln I was wondering what that is
Popolon ???? (ximalaya) = Himalaya mountains
Popolon for example
Popolon that's pronouciation translation
kamathln Hey.. I am off to dinner .. Will be back in half an hour
kamathln mom yelling
Popolon ?=happiness, ?= horse, ?=pull, ?=elegant
Popolon this is clearly for pronounciation purpose
kamathln And I am off
kamathln pierremarc: Popolon: it was a very informative chat session .. thx very much
http://www.fontmatrix.net/kamathln/brh2utf8.tbz
http://www.infitt.org/ti2002/papers/64RAGHUV.PDF
http://www.baraha.com/web_docs/glyph_codes_kan.htm
http://unicode.org/charts/PDF/U0C80.pdf
http://omega.enstb.org/yannis/pdf/indica-santa94.pdf
http://www.w3.org/2002/Talks/09-ri-indic/indic-paper.html

Slightly offtopic:
http://www.w3.org/TR/SVG11/fonts.html

3 comments:

Kiran krishnaiah said...

Hi Lanky

Am kiran here, First thanks for uploading baraha 2 utf8 code. I have worked on it. As you mentioned work on database yet to be completed, If your interested i would like to be the part of the development. Please send me documents for updating the database. I would be very happy to complete the database as part of my work requires convertion from baraha txt to utf8 txt.

Thanks in Advance

Thanks & Regards
Kiran.K

lankythoughts said...

I recommend using http://fci.wikia.com/wiki/SMC/Payyans as the base more than my code. It is tried and tested. Yes, It is for Malayalam. But I think it could be modified for Baraha to Unicode Kannada, with a bit of work.

My code is much more of a wierd search and replace function than Baraha 2

Kiran krishnaiah said...

Hi Lanky

Thanks for the immediate response buddy. Payyans has everything what i needed. I shall work over it and get back to you.

Once again thanks a lot.

Thanks & Regards
Kiran.k