Tuesday, 21 July 2015

A Universal Date type in Nepali

During my corpus collection work in Nepali, I wanted to understand all types/formats of dates available in the News sources. Many news sources keep their date information in  different formats. For example: eKantipur uses २०७२ श्रावण ५ ०८:३१  format where as Nagarik News uses मङ्गलबार ५ श्रावण, २०७२ format and so on.  The diagrammatic representation of the system in state machine is given below.

Fig. State machine for different Nepalese date formats.

My intuition is to make this corpus searchable too. So, I wrote a computer program that understand different formats of Nepalese date and index it into sort-able formats.  For more detail, click here.

Tuesday, 7 July 2015

Research – Morphological and Sentiment Analysis based on Nepali Corpus

My Work: A morphological analyzer for Nepali language and Sentiment analysis as a classifier is in online with demo is posted here.

Friday, 3 July 2015

Beginning Stuff: Lab problems on C - Programming for Undergraduate students

My under graduate students ask me to post some materials on C-Programming. This post helps you to begin in programming with some implementation of problems from undergraduate first year. 
   
This sheet introduces the IDE development environment, basics of program coding, compilation  and run, and also familiarize the printf and scanf functions. This is designed for two lab days.

This sheet helps students to convert the program specification into the C program i.e. flowchart to program. It especially designed to be familiar with appropriate data types and the data arithmetic. It requires one lab day.

This sheet is designed to learn the usage of decision statements in C programming and requires two lab days. 

This sheet introduces the loop statements in C programming, I mean simple is without nesting and requires three lab days.

This sheet introduces the loops and arrays in C programming. Students will realize the importance of loops nesting here and requires two lab days.

This sheet is designed to make students familiar with strings and arrays. It requires two lab days to complete.

This sheet introduces writing and calling functions in C programming and requires two lab days.

This sheet introduces the definition and usage of pointers type to solve the problems in C programming. It requires two lab days to complete.

This sheet is designed to make students familiar with file input/output and the usage of structure (user defined type) in C programming language. It requires four lab days.

I like to post some sample solutions here:

1. Solution of dictionary based string comparison.

#include<stdio.h>

int compareStrings(char *, char *);
int getStringLength(char *);
char* strToupper(char*);

int compareStrings(char *a, char *b){
    int result = 0;
    while(*a != '\0' || *b != '\0'){
    printf("Iteration : compare %c and %c.\n", *a, *b);
        if(*a > *b){
            return 1;
        }else if (*a < *b){
            return -1;
        }
        a++;b++;
    }
    if(*a != '\0'){
        return 1;
    }else if(*b != '\0'){
        return -1;
    }
    return result;
}

char *strToupper(char *str){
    int i = 0;
    char *rStr = (char*) malloc((strlen(str) + 1) * sizeof(char));
    for(; str[i]; ++i){
        if((str[i] >= 'a') && (str[i] <= 'z'))
            rStr[i] = str[i] + 'A' - 'a';
        else
            rStr[i] = str[i];
    }
    rStr[i] = '\0';
    return rStr;
}

int getStringLength(char *str){
    int result = 0;
    while(*str != '\0'){
        result ++;
        str++;
    }
    return result;
}

int main(){
    char *first = "appze";
    char *second = "apple";
    int result = compareStrings(first, second);
    first = strToupper(first);
    second = strToupper(second);
    printf("Comparison result : %d\n", result);
    printf("Length of %s = %d\n", first, getStringLength(first));
    printf("Length of %s = %d\n", second, getStringLength(second));
    return(0);
}

2. Solution to calculate angle between two vectors.

//Vector Demo in C
#include <stdio.h>
#include <math.h>
#define MAX 10 // denotes the maximum dimensions of a vector

typedef struct {
    double coef[MAX];
}Vector;

double Magnitude(Vector);
double DotProduct(Vector, Vector);
double CostTheta(Vector, Vector);
double Theta(Vector, Vector);

double Magnitude(Vector v){
    double sum = 0.0;
    int index;
    for(index = 0; index < MAX; index++){
        sum += v.coef[index] * v.coef[index];
    }
    return sqrt(sum);
}

double DotProduct(Vector v, Vector w){
    double sum = 0.0;
    int index;
    for(index = 0; index < MAX; index++){
        sum += v.coef[index] * w.coef[index];
    }
    return sum;
}

double CosTheta(Vector x, Vector y){
    return DotProduct(x, y)/(Magnitude(x) * Magnitude(y));
}

double Theta (Vector v, Vector w){
    return acos(CosTheta(v, w)) * 180.0 / M_PI;
}

int main(){
    Vector v1 = {{40, 0, 45, 30, 20, 89, 90, 100, 5, 0}};
    Vector v2 = {{20, 100, 6, 89, 999, 9, 900, 89, 50, 21}};
    printf("Angle : %.2f degree.\n", Theta(v1, v2));
    return 0;
}

I believe that these sheets of problems helps to improve the programming skill of especially undergraduate students and someone who begins to write computer program in C. Please feel free to post your comments, you are most welcome. Thanks to Shailesh and Nischal Dai.




Sunday, 21 June 2015

Searching Nepali Text

It was late 2007, I was trying to develop the system that could search queries and displays the result from huge Nepali archives. There were two problems:

1. How to get huge amount of Nepali text ?
2. How to make them searchable and search with high relevancy ?

 I developed some state machines (a simple, equivalent regular expression) that can able accumulate news archives from the web dumped of many news sources.  A sample machine  for Himal News Paper is here.


Fig. State machine that accumulates news of Himal Khabar.

The java source representation of the state machine is:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
/*
 * To change this template, choose Tools | Templates
 * and open the template in the editor.
 */

/**
 *
 * @author Santa
 */
public class Himal { 
    
    private boolean isNepali(char ch) {
        if (ch >= 0x900 && ch <= 0x97f) {
            return (true);
        } else if (ch == ',' || ch == '?' || ch=='!') {
            return (true);
        } else if (ch == ' ') {
            return (true);
        } else if (ch == '\n') {
            return (true);
        } else {
            return (false);
        }
    }
    
    private StringBuffer filterText(StringBuffer s){
        StringBuffer result = new StringBuffer();
        char ch;
        int state = 0;
        for(int i=0;i<s.length();i++){
            ch = s.charAt(i);
            if(ch==',' || ch=='?' || ch=='เคƒ'){
                if(state != 0){
                    result.append(ch);
                    result.append(" ");
                }
                state = 0;
            }
            else if(ch==' '){
                if(state != 0)
                    result.append(ch);
                state = 0;
            }             
            else{
                state = 1;
                result.append(ch);
            }
        }
        return(result);
    }
    
   private StringBuffer dfaHimal(StringBuffer buffer){
        StringBuffer result = new StringBuffer();       
        String tmp = "";
        int state = 1;
        char ch;
        for(int i=0;i<buffer.length();i++){
            ch = buffer.charAt(i);          
            switch(state){
                case 1:
                    if(ch=='=')
                        state = 2;
                    else
                        state = 1;                   
                    break;
                case 2:
                    if(ch=='"')
                        state = 3;                    
                    else
                        state = 1;
                    break;
                case 3:
                    if(ch=='"'){
                        if(tmp.equals("headline")||tmp.equals("articletext")||tmp.equals("intro"))                                                        
                            state = 4;      
                        else
                            state = 1;
                        tmp = "";
                    }                    
                    else{
                        tmp += ch;
                        state = 3;
                    }
                    break;  
                case 4: 
                    if(isNepali(ch))
                        result.append(ch);                    
                    if(ch=='<')
                        state = 5;
                    else
                        state = 4;
                    break;
                case 5:
                    if(ch=='>'){
                        if(tmp.equals("br")||tmp.equals("/p")||tmp.equals("/P")||tmp.equals("BR")||tmp.equals("/TABLE")){
                            result.append("\r\n44  ");
                            state = 4;
                        }
                        else if(tmp.equals("/div")){
                            result.append("\r\n");
                            state = 1;
                        }
                        else
                            state = 4;
                    }
                    else if(ch=='='){
                        state = 6;
                    }
                    else{
                        tmp += ch;
                        state = 5;
                    }
                    break;
                case 6:
                    if(ch=='"'){
                        tmp = "";
                        state = 7; 
                    }
                    else
                        state = 4;
                    break;
                case 7:
                    if(ch=='"'){
                        if(tmp.equals("headlines"))
                            return(result);
                        else
                            state = 5;                      
                    }
                    else{
                        tmp += ch;
                        state = 7;
                    }
            }
        }
        return(result);
    }
    
    public StringBuffer getNewsText(String file){
        StringBuffer result = new StringBuffer();
        StringBuffer buffer = new StringBuffer();
        FileReadWrite obj = new FileReadWrite();
        buffer = obj.readFile(file);     
        result = dfaHimal(buffer);
        return(result);
    }

}

It was the beginning of the searching system development.