Tuesday, 21 July 2015

A Universal Date type in Nepali

During my corpus collection work in Nepali, I wanted to understand all types/formats of dates available in the News sources. Many news sources keep their date information in  different formats. For example: eKantipur uses २०७२ श्रावण ५ ०८:३१  format where as Nagarik News uses मङ्गलबार ५ श्रावण, २०७२ format and so on.  The diagrammatic representation of the system in state machine is given below.

Fig. State machine for different Nepalese date formats.

My intuition is to make this corpus searchable too. So, I wrote a computer program that understand different formats of Nepalese date and index it into sort-able formats.  For more detail, click here.

Tuesday, 7 July 2015

Research – Morphological and Sentiment Analysis based on Nepali Corpus

My Work: A morphological analyzer for Nepali language and Sentiment analysis as a classifier is in online with demo is posted here.

Friday, 3 July 2015

Beginning Stuff: Lab problems on C - Programming for Undergraduate students

My under graduate students ask me to post some materials on C-Programming. This post helps you to begin in programming with some implementation of problems from undergraduate first year. 
   
This sheet introduces the IDE development environment, basics of program coding, compilation  and run, and also familiarize the printf and scanf functions. This is designed for two lab days.

This sheet helps students to convert the program specification into the C program i.e. flowchart to program. It especially designed to be familiar with appropriate data types and the data arithmetic. It requires one lab day.

This sheet is designed to learn the usage of decision statements in C programming and requires two lab days. 

This sheet introduces the loop statements in C programming, I mean simple is without nesting and requires three lab days.

This sheet introduces the loops and arrays in C programming. Students will realize the importance of loops nesting here and requires two lab days.

This sheet is designed to make students familiar with strings and arrays. It requires two lab days to complete.

This sheet introduces writing and calling functions in C programming and requires two lab days.

This sheet introduces the definition and usage of pointers type to solve the problems in C programming. It requires two lab days to complete.

This sheet is designed to make students familiar with file input/output and the usage of structure (user defined type) in C programming language. It requires four lab days.

I like to post some sample solutions here:

1. Solution of dictionary based string comparison.

#include<stdio.h>

int compareStrings(char *, char *);
int getStringLength(char *);
char* strToupper(char*);

int compareStrings(char *a, char *b){
    int result = 0;
    while(*a != '\0' || *b != '\0'){
    printf("Iteration : compare %c and %c.\n", *a, *b);
        if(*a > *b){
            return 1;
        }else if (*a < *b){
            return -1;
        }
        a++;b++;
    }
    if(*a != '\0'){
        return 1;
    }else if(*b != '\0'){
        return -1;
    }
    return result;
}

char *strToupper(char *str){
    int i = 0;
    char *rStr = (char*) malloc((strlen(str) + 1) * sizeof(char));
    for(; str[i]; ++i){
        if((str[i] >= 'a') && (str[i] <= 'z'))
            rStr[i] = str[i] + 'A' - 'a';
        else
            rStr[i] = str[i];
    }
    rStr[i] = '\0';
    return rStr;
}

int getStringLength(char *str){
    int result = 0;
    while(*str != '\0'){
        result ++;
        str++;
    }
    return result;
}

int main(){
    char *first = "appze";
    char *second = "apple";
    int result = compareStrings(first, second);
    first = strToupper(first);
    second = strToupper(second);
    printf("Comparison result : %d\n", result);
    printf("Length of %s = %d\n", first, getStringLength(first));
    printf("Length of %s = %d\n", second, getStringLength(second));
    return(0);
}

2. Solution to calculate angle between two vectors.

//Vector Demo in C
#include <stdio.h>
#include <math.h>
#define MAX 10 // denotes the maximum dimensions of a vector

typedef struct {
    double coef[MAX];
}Vector;

double Magnitude(Vector);
double DotProduct(Vector, Vector);
double CostTheta(Vector, Vector);
double Theta(Vector, Vector);

double Magnitude(Vector v){
    double sum = 0.0;
    int index;
    for(index = 0; index < MAX; index++){
        sum += v.coef[index] * v.coef[index];
    }
    return sqrt(sum);
}

double DotProduct(Vector v, Vector w){
    double sum = 0.0;
    int index;
    for(index = 0; index < MAX; index++){
        sum += v.coef[index] * w.coef[index];
    }
    return sum;
}

double CosTheta(Vector x, Vector y){
    return DotProduct(x, y)/(Magnitude(x) * Magnitude(y));
}

double Theta (Vector v, Vector w){
    return acos(CosTheta(v, w)) * 180.0 / M_PI;
}

int main(){
    Vector v1 = {{40, 0, 45, 30, 20, 89, 90, 100, 5, 0}};
    Vector v2 = {{20, 100, 6, 89, 999, 9, 900, 89, 50, 21}};
    printf("Angle : %.2f degree.\n", Theta(v1, v2));
    return 0;
}

I believe that these sheets of problems helps to improve the programming skill of especially undergraduate students and someone who begins to write computer program in C. Please feel free to post your comments, you are most welcome. Thanks to Shailesh and Nischal Dai.




Sunday, 21 June 2015

Searching Nepali Text

It was late 2007, I was trying to develop the system that could search queries and displays the result from huge Nepali archives. There were two problems:

1. How to get huge amount of Nepali text ?
2. How to make them searchable and search with high relevancy ?

 I developed some state machines (a simple, equivalent regular expression) that can able accumulate news archives from the web dumped of many news sources.  A sample machine  for Himal News Paper is here.


Fig. State machine that accumulates news of Himal Khabar.

The java source representation of the state machine is:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
/*
 * To change this template, choose Tools | Templates
 * and open the template in the editor.
 */

/**
 *
 * @author Santa
 */
public class Himal { 
    
    private boolean isNepali(char ch) {
        if (ch >= 0x900 && ch <= 0x97f) {
            return (true);
        } else if (ch == ',' || ch == '?' || ch=='!') {
            return (true);
        } else if (ch == ' ') {
            return (true);
        } else if (ch == '\n') {
            return (true);
        } else {
            return (false);
        }
    }
    
    private StringBuffer filterText(StringBuffer s){
        StringBuffer result = new StringBuffer();
        char ch;
        int state = 0;
        for(int i=0;i<s.length();i++){
            ch = s.charAt(i);
            if(ch==',' || ch=='?' || ch=='เคƒ'){
                if(state != 0){
                    result.append(ch);
                    result.append(" ");
                }
                state = 0;
            }
            else if(ch==' '){
                if(state != 0)
                    result.append(ch);
                state = 0;
            }             
            else{
                state = 1;
                result.append(ch);
            }
        }
        return(result);
    }
    
   private StringBuffer dfaHimal(StringBuffer buffer){
        StringBuffer result = new StringBuffer();       
        String tmp = "";
        int state = 1;
        char ch;
        for(int i=0;i<buffer.length();i++){
            ch = buffer.charAt(i);          
            switch(state){
                case 1:
                    if(ch=='=')
                        state = 2;
                    else
                        state = 1;                   
                    break;
                case 2:
                    if(ch=='"')
                        state = 3;                    
                    else
                        state = 1;
                    break;
                case 3:
                    if(ch=='"'){
                        if(tmp.equals("headline")||tmp.equals("articletext")||tmp.equals("intro"))                                                        
                            state = 4;      
                        else
                            state = 1;
                        tmp = "";
                    }                    
                    else{
                        tmp += ch;
                        state = 3;
                    }
                    break;  
                case 4: 
                    if(isNepali(ch))
                        result.append(ch);                    
                    if(ch=='<')
                        state = 5;
                    else
                        state = 4;
                    break;
                case 5:
                    if(ch=='>'){
                        if(tmp.equals("br")||tmp.equals("/p")||tmp.equals("/P")||tmp.equals("BR")||tmp.equals("/TABLE")){
                            result.append("\r\n44  ");
                            state = 4;
                        }
                        else if(tmp.equals("/div")){
                            result.append("\r\n");
                            state = 1;
                        }
                        else
                            state = 4;
                    }
                    else if(ch=='='){
                        state = 6;
                    }
                    else{
                        tmp += ch;
                        state = 5;
                    }
                    break;
                case 6:
                    if(ch=='"'){
                        tmp = "";
                        state = 7; 
                    }
                    else
                        state = 4;
                    break;
                case 7:
                    if(ch=='"'){
                        if(tmp.equals("headlines"))
                            return(result);
                        else
                            state = 5;                      
                    }
                    else{
                        tmp += ch;
                        state = 7;
                    }
            }
        }
        return(result);
    }
    
    public StringBuffer getNewsText(String file){
        StringBuffer result = new StringBuffer();
        StringBuffer buffer = new StringBuffer();
        FileReadWrite obj = new FileReadWrite();
        buffer = obj.readFile(file);     
        result = dfaHimal(buffer);
        return(result);
    }

}

It was the beginning of the searching system development.


Sunday, 12 September 2010

An English Nepali Toy Machine Translation System – Dobhase (दोभासे)

Dobhase, an English Nepali translation system was developed by joint effort of Kathmandu Universtiy and
MPP in 2006. It is a rule based(dictionary based) system and works in few simple sentenced examples.
For more information click here.

A leading organization in Nepali Language Computation

MPP(Madan Puraskar Pustaklaya) is a leading organization in Nepali Language computation. Lots of information regarding Nepali archived data, current projects and developments and materials are provided here.

Few Years back, TTF to Unicode converstion

Few years back, I was working in font conversion. This system does the converstion of PCS Nepali ttf font to Nepali Unicode. The main purpose of this project was to convert all the Nepali archived voter list. I hope this will help to all who want to explore in font processing and data conversion.

Conversion Source :

001 /*
002  * To change this template, choose Tools | Templates
003  * and open the template in the editor.
004  */
005 
006 import java.util.*;
007 import java.util.ArrayList;
008 
009 /**
010  *
011  @author Santa
012  */
013 public class unicodeConversion {
014 
015     List<String> halfLetter = new ArrayList<String>();
016     fileTokenizer ftok;
017 
018     public unicodeConversion() {
019         ftok = new fileTokenizer();
020         halfLetter = ftok.readFileText("halfLetter.txt");
021     }
022 
023     String getUnicode(char ch) {
024         switch (ch) {
025             case '0':
026                 return ("०");
027             case '1':
028                 return ("१");
029             case '2':
030                 return ("२");
031             case '3':
032                 return ("३");
033             case '4':
034                 return ("४");
035             case '5':
036                 return ("५");
037             case '6':
038                 return ("६");
039             case '7':
040                 return ("७");
041             case '8':
042                 return ("८");
043             case '9':
044                 return ("९");
045             case '~':
046                 return ("ङ");
047             case '`':
048                 return ("ञ्");
049             case '!':
050                 return ("ज्ञ");
051             case '@':
052                 return ("ई");
053             case '#':
054                 return ("घ");
055             case '$':
056                 return ("द्ध");
057             case '%':
058                 return ("छ");
059             case '^':
060                 return ("ट");
061             case '&':
062                 return ("ठ");
063             case '*':
064                 return ("ड");
065             case '(':
066                 return ("ढ");
067             case ')':
068                 return ("ण्");
069             case '-':
070                 return ("(");
071             case '_':
072                 return (")");
073             case '+':
074                 return ("ं");
075             case '=':
076                 return ("॰");
077             case 'q':
078                 return ("त्र");
079             case 'w':
080                 return ("ध");
081             case 'e':
082                 return ("भ");
083             case 'r':
084                 return ("च");
085             case 't':
086                 return ("त");
087             case 'Q':
088                 return ("त्त");
089             case 'W':
090                 return ("ध्");
091             case 'E':
092                 return ("भ्");
093             case 'R':
094                 return ("च्");
095             case 'T':
096                 return ("त्");
097             case 'p':
098                 return ("उ");
099             case 'o':
100                 return ("य");
101             case 'i':
102                 return ("ष्");
103             case 'u':
104                 return ("ग");
105             case 'y':
106                 return ("थ");
107             case 'P':
108                 return ("ए");
109             case 'O':
110                 return ("इ");
111             case 'I':
112                 return ("क्ष्");
113             case 'U':
114                 return ("ग्");
115             case 'Y':
116                 return ("थ्");
117             case '[':
118                 return ("ृ");
119             case '{':
120                 return ("र्");
121             case ']':
122                 return ("े");
123             case '}':
124                 return ("ै");
125             case 'a':
126                 return ("ब");
127             case 's':
128                 return ("क");
129             case 'd':
130                 return ("म");
131             case 'f':
132                 return ("ा");
133             case 'g':
134                 return ("न");
135             case 'A':
136                 return ("ब्");
137             case 'S':
138                 return ("क्");
139             case 'D':
140                 return ("म्");
141             case 'F':
142                 return ("ँ");
143             case 'G':
144                 return ("न्");
145             case ';':
146                 return ("स");
147             case 'l':
148                 return ("ि");
149             case 'k':
150                 return ("प");
151             case 'j':
152                 return ("व");
153             case 'h':
154                 return ("ज");
155             case ':':
156                 return ("स्");
157             case 'L':
158                 return ("ी");
159             case 'K':
160                 return ("प्");
161             case 'J':
162                 return ("व्");
163             case 'H':
164                 return ("ज्");
165             case 'z':
166                 return ("श");
167             case 'x':
168                 return ("ह");
169             case 'c':
170                 return ("अ");
171             case 'v':
172                 return ("ख");
173             case 'Z':
174                 return ("श्");
175             case 'X':
176                 return ("ह्");
177             //case 'C':
178             //return("");
179             case 'V':
180                 return ("ख्");
181             case '/':
182                 return ("र");
183             case '.':
184                 return ("।");
185             case ',':
186                 return ("‚");
187             //case 'm':
188             //return("");
189             case 'n':
190                 return ("ल");
191             case 'b':
192                 return ("द");
193             case '?':
194                 return ("रू");
195             case '>':
196                 return ("श्र");
197             //case '<':
198             //return("");    
199             case 'M':
200                 return ("ः");
201             case 'N':
202                 return ("ल्");
203             case 'B':
204                 return ("द्य");
205 
206             //Special Characters
207             //case '|':
208             //return ("र्");
209             case '¤':
210                 return ("ँ");
211             case 'ñ':
212                 return ("ङ");
213             case '\'':
214                 return ("ु");
215             case '\"':
216                 return ("ू");
217                 
218             case 161:
219                 return("ज्ञ्");
220             case 162:
221                 return("द्ध");                
222             case 163:
223                 return("घ्");
224             case 165:
225                 return("ऋ");
226             case 167:
227                 return("ट्ट");
228             case 169:
229                 return("?");
230             case 170:
231                 return("ञ");
232             case 174:
233                 return("‌+");
234             case 176:
235                 return("ङ्क");
236             case 180:
237                 return("झ");
238             case 183:
239                 return("ठ्ठ");
240             case 191:
241                 return("रु");
242             case 198:
243                 return("\"");
244             case 210:
245                 return("ू");
246             case 212:
247                 return("क्ष");
248             //case 216:
249                 //return("");
250             case 217:
251                 return("ह");
252             case 218:
253                 return("ु");
254             case 223:
255                 return("द्म");
256             case 229:
257                 return("द्व");
258             case 230:
259                 return("\"");
260             case 231:
261                 return("ॐ");
262             case 233:
263                 return("ङ्ग");
264             case 237:
265                 return("ष");
266             case 247:
267                 return("/");
268             case 248:
269                 return("य्");
270             case 250:
271                 return("ू");
272             
273             default:
274                 return (ch + "");
275         }
276     }
277 
278     private boolean isHalf(String s) {
279         if (halfLetter.contains(s)) {
280             return (true);
281         else {
282             return (false);
283         }
284     }
285 
286     private boolean isToLookNext(char ch) {
287         switch (ch) {
288             case 'e'://भ
289             case 'k'://प
290             case 'i'://ष्
291             case 'I'://क्ष्
292             case 'f'://ा
293             case 'Q'://त्त
294                 return (true);
295             default:
296                 return (false);
297         }
298     }
299 
300     private boolean isToLookPrevious(char ch) {
301         switch (ch) {
302             case '|':
303                 return (true);
304             default:
305                 return (false);
306         }
307     }
308 
309     private boolean isAakar(char ch) {
310         switch (ch) {
311             case 'ा':
312             case 'ु':
313             case 'ू':
314             case 'ो':
315             case 'ौ':
316             case 'ी':
317             case 'ि':
318             case 'े':
319             case 'ै':
320                 return (true);
321             default:
322                 return (false);
323         }
324     }
325 
326     private StringBuffer insertHalfRa(StringBuffer data) {
327         char ch = data.charAt(data.length() 1);
328         try {
329             if (isAakar(ch)) {
330                 data.insert(data.length() 2"र्");
331             else {
332                 data.insert(data.length() 1"र्");
333             }
334         catch (Exception ee) {
335             //data.insert(data.length()-1, "र्");
336         }
337         return (data);
338     }
339 
340     private String processHalfRa(StringBuffer data) {
341         StringBuffer result = new StringBuffer();
342         int i = 0;
343         char ch, ch1;
344         while (i < data.length()) {
345             ch = data.charAt(i);
346             if (ch == 'र') {
347                 i++;
348                 try {
349                     ch1 = data.charAt(i);
350                     if (ch1 == '्') {
351                         //Inserting it into Previous Character
352                         result = insertHalfRa(result);
353                     else {
354                         result.append(ch);
355                         i--;
356                     }
357                 catch (Exception ee) {
358                     result.append(ch);
359                     i--;
360                 }
361             else {
362                 result.append(ch);
363             }
364             i++;
365         }
366         return (result.toString());
367     }
368 
369     public String convertToUnicode(String s) {
370         String result = "";
371         int i;
372         char ch, ch1, ch2;
373         boolean iflag = false;
374         String tmp;
375         i = 0;
376         while (i < s.length()) {
377             try {
378                 ch = s.charAt(i);
379                 //Special अ process
380                 if (ch == 'c') {
381                     i++;
382                     ch1 = s.charAt(i);
383                     if (ch1 == 'f') {
384                         if (i + < s.length()) {
385                             i++;
386                             ch2 = s.charAt(i);
387                             if (ch2 == ']') {
388                                 result += "ओ";
389                             else if (ch2 == '}') {
390                                 result += "औ";
391                             else {
392                                 i--;
393                                 result += "आ";
394                             }
395                         }
396                     else {
397                         result += "अ";
398                         i--;
399                     }
400                 //Looking Next Char
401                 else if (isToLookNext(ch)) {
402                     i++;
403                     ch1 = s.charAt(i);
404                     if (ch == 'e' && ch1 == 'm') {
405                         result += "झ";
406                         if (iflag) {
407                             result += "ि";
408                             iflag = false;
409                         }
410                     else if (ch == 'k' && ch1 == 'm') {
411                         result += "फ";
412                         if (iflag) {
413                             result += "ि";
414                             iflag = false;
415                         }
416                     else if (ch == 'Q' && ch1 == 'm') {
417                         result += "क्त";
418                         if (iflag) {
419                             result += "ि";
420                             iflag = false;
421                         }
422                     else if (ch == 'i' && ch1 == 'f') {
423                         result += "ष";
424                         if (iflag) {
425                             result += "ि";
426                             iflag = false;
427                         }
428                     else if (ch == 'I' && ch1 == 'f') {
429                         result += "क्ष";
430                         if (iflag) {
431                             result += "ि";
432                             iflag = false;
433                         }
434                     else if (ch == 'f' && ch1 == ']') {
435                         result += "ो";
436                     else if (ch == 'f' && ch1 == '}') {
437                         result += "ौ";
438                     else {
439                         i--;
440                         tmp = getUnicode(ch);
441                         result += tmp;
442                         if (iflag) {
443                             if (!isHalf(tmp)) {
444                                 result += "ि";
445                                 iflag = false;
446                             }
447                         }
448                     }
449                 else if (isToLookPrevious(ch)) {
450                     result += "्";
451                     result += "र";
452                     if (iflag) {
453                         result += "ि";
454                         iflag = false;
455                     }
456                 else {
457                     if (ch == 'l') {
458                         iflag = true;
459                     else {
460                         tmp = getUnicode(ch);
461                         result += tmp;
462                         try {
463                             ch1 = s.charAt(i + 1);
464                             if (ch1 != '|') {
465                                 if (iflag) {
466                                     if (!isHalf(tmp)) {
467                                         result += "ि";
468                                         iflag = false;
469                                     }
470                                 }
471                             }
472                         catch (Exception ee) {
473                         }
474                     }
475                 }
476             catch (Exception ee) {
477                 i--;
478                 ch = s.charAt(i);
479                 tmp = getUnicode(ch);
480                 result += tmp;
481             }
482             i++;
483         }
484         result = processHalfRa(new StringBuffer(result));
485         return (result);
486     }
487 }

In the above source, "halfLetter.txt" contains all the half characters representation of Nepali text.

Usage :
String Output = new unicodeConversion().convertToUnicode(src);

Cheers !
Santa