Compiler Design

Design a Compiler - Building Nepal's First Nepali Programming Language

The journey of creating Sakshyar, Nepal's first programming language in Nepali, from tokenization to code generation. Because programming shouldn't be limited by language barriers.

5 min read
compiler-design programming-language nepali tokenization parser code-generation

During my third year at IOE Pashchimanchal Campus, I had what some might call a crazy idea: What if we could program in Nepali? Not just comments or variable names, but actual syntax, keywords, and control structures. This led to the creation of Sakshyar - Nepal’s first programming language where you could write code like “यदि (x > 5) भने” instead of “if (x > 5) then”.

The “Why” Behind the Madness

The inspiration came from watching my juniors struggle with English syntax while learning programming. They understood the logic perfectly but stumbled over keywords like “while”, “function”, and “return”. I thought: why should language be a barrier to logical thinking?

The Vision

Create a programming language that uses Nepali syntax and keywords, making programming more accessible to Nepali speakers while maintaining the power and flexibility of modern languages.

Compiler Architecture: The Big Picture

Before diving into code, let’s understand what we’re building. A compiler is essentially a translator that converts high-level code into machine-executable instructions. Here’s how Sakshyar’s compiler works:

Source Code (Nepali) → Lexer → Parser → AST → Code Generator → Target Code

Input Example (Sakshyar)

फंक्शन जोड(क, ख) {
    फर्काउ क + ख;
}

मुख्य() {
    परिणाम = जोड(५, ३);
    प्रिन्ट(परिणाम);
}

Generated C Code

int jod(int k, int kh) {
    return k + kh;
}

int main() {
    int parinam = jod(5, 3);
    printf("%d", parinam);
    return 0;
}

Phase 1: Lexical Analysis (Tokenization)

The lexer’s job is to break the source code into tokens. This was particularly challenging with Nepali text because of Unicode complexities and different ways to write the same character.

// Token types for Sakshyar
typedef enum {
    TOKEN_FUNCTION,     // फंक्शन
    TOKEN_IF,          // यदि
    TOKEN_ELSE,        // अन्यथा
    TOKEN_WHILE,       // जबसम्म
    TOKEN_RETURN,      // फर्काउ
    TOKEN_PRINT,       // प्रिन्ट
    TOKEN_NUMBER,      // ५, ३, etc.
    TOKEN_IDENTIFIER,  // variable names
    TOKEN_PLUS,        // +
    TOKEN_MINUS,       // -
    TOKEN_MULTIPLY,    // *
    TOKEN_DIVIDE,      // /
    TOKEN_LPAREN,      // (
    TOKEN_RPAREN,      // )
    TOKEN_LBRACE,      // {
    TOKEN_RBRACE,      // }
    TOKEN_SEMICOLON,   // ;
    TOKEN_EOF
} TokenType;

typedef struct {
    TokenType type;
    char* value;
    int line;
    int column;
} Token;

Unicode Challenge

Nepali Unicode characters can be represented in multiple ways. For example, “फ” can be written as a single character or as “फ” + “्” + combining characters. Our lexer needed to normalize these representations.

// Simplified lexer implementation
Token* nextToken(Lexer* lexer) {
    skipWhitespace(lexer);
    
    if (isAtEnd(lexer)) {
        return createToken(TOKEN_EOF, "");
    }
    
    char c = advance(lexer);
    
    // Handle Nepali keywords
    if (isNepaliLetter(c)) {
        return nepaliIdentifier(lexer);
    }
    
    // Handle numbers (both English and Nepali digits)
    if (isDigit(c) || isNepaliDigit(c)) {
        return number(lexer);
    }
    
    // Handle operators and punctuation
    switch (c) {
        case '+': return createToken(TOKEN_PLUS, "+");
        case '-': return createToken(TOKEN_MINUS, "-");
        case '*': return createToken(TOKEN_MULTIPLY, "*");
        case '/': return createToken(TOKEN_DIVIDE, "/");
        case '(': return createToken(TOKEN_LPAREN, "(");
        case ')': return createToken(TOKEN_RPAREN, ")");
        case '{': return createToken(TOKEN_LBRACE, "{");
        case '}': return createToken(TOKEN_RBRACE, "}");
        case ';': return createToken(TOKEN_SEMICOLON, ";");
    }
    
    // Error handling
    reportError(lexer, "Unexpected character");
    return NULL;
}

Phase 2: Syntax Analysis (Parsing)

The parser takes tokens and builds an Abstract Syntax Tree (AST). I used a recursive descent parser because it’s easier to understand and debug - crucial when you’re building something from scratch.

// AST Node types
typedef enum {
    NODE_PROGRAM,
    NODE_FUNCTION,
    NODE_IF_STATEMENT,
    NODE_WHILE_LOOP,
    NODE_RETURN_STATEMENT,
    NODE_PRINT_STATEMENT,
    NODE_BINARY_OP,
    NODE_UNARY_OP,
    NODE_FUNCTION_CALL,
    NODE_IDENTIFIER,
    NODE_NUMBER
} NodeType;

typedef struct ASTNode {
    NodeType type;
    union {
        struct {
            char* name;
            struct ASTNode** parameters;
            int paramCount;
            struct ASTNode* body;
        } function;
        
        struct {
            struct ASTNode* condition;
            struct ASTNode* thenBranch;
            struct ASTNode* elseBranch;
        } ifStmt;
        
        struct {
            char* operator;
            struct ASTNode* left;
            struct ASTNode* right;
        } binaryOp;
        
        struct {
            char* name;
            struct ASTNode** arguments;
            int argCount;
        } functionCall;
        
        char* identifier;
        int number;
    };
} ASTNode;

Pro Tip: Grammar First

Before writing parser code, define your language grammar in BNF (Backus-Naur Form). It saves hours of debugging later and helps you think through edge cases.

// Sakshyar Grammar (simplified)
program        → function* EOF
function       → "फंक्शन" IDENTIFIER "(" parameters? ")" "{" statement* "}"
parameters     → IDENTIFIER ("," IDENTIFIER)*
statement      → ifStatement | whileStatement | returnStatement | 
                 printStatement | expressionStatement
ifStatement    → "यदि" "(" expression ")" "{" statement* "}" 
                 ("अन्यथा" "{" statement* "}")?
whileStatement → "जबसम्म" "(" expression ")" "{" statement* "}"
returnStatement→ "फर्काउ" expression ";"
printStatement → "प्रिन्ट" "(" expression ")" ";"
expression     → equality
equality       → comparison (("==" | "!=") comparison)*
comparison     → term ((">" | ">=" | "<" | "<=") term)*
term           → factor (("+" | "-") factor)*
factor         → unary (("*" | "/") unary)*
unary          → ("!" | "-") unary | primary
primary        → NUMBER | IDENTIFIER | "(" expression ")" | functionCall

Recursive Descent Parser Implementation

// Parse function declaration
ASTNode* parseFunction(Parser* parser) {
    consume(parser, TOKEN_FUNCTION, "Expected 'फंक्शन'");
    
    Token* name = consume(parser, TOKEN_IDENTIFIER, "Expected function name");
    
    consume(parser, TOKEN_LPAREN, "Expected '(' after function name");
    
    // Parse parameters
    ASTNode** parameters = NULL;
    int paramCount = 0;
    
    if (!check(parser, TOKEN_RPAREN)) {
        do {
            Token* param = consume(parser, TOKEN_IDENTIFIER, "Expected parameter name");
            // Add parameter to list
            parameters = realloc(parameters, (paramCount + 1) * sizeof(ASTNode*));
            parameters[paramCount] = createIdentifierNode(param->value);
            paramCount++;
        } while (match(parser, TOKEN_COMMA));
    }
    
    consume(parser, TOKEN_RPAREN, "Expected ')' after parameters");
    consume(parser, TOKEN_LBRACE, "Expected '{' before function body");
    
    // Parse function body
    ASTNode* body = parseBlock(parser);
    
    consume(parser, TOKEN_RBRACE, "Expected '}' after function body");
    
    return createFunctionNode(name->value, parameters, paramCount, body);
}

// Parse if statement
ASTNode* parseIfStatement(Parser* parser) {
    consume(parser, TOKEN_IF, "Expected 'यदि'");
    consume(parser, TOKEN_LPAREN, "Expected '(' after 'यदि'");
    
    ASTNode* condition = parseExpression(parser);
    
    consume(parser, TOKEN_RPAREN, "Expected ')' after if condition");
    consume(parser, TOKEN_LBRACE, "Expected '{' before if body");
    
    ASTNode* thenBranch = parseBlock(parser);
    
    consume(parser, TOKEN_RBRACE, "Expected '}' after if body");
    
    ASTNode* elseBranch = NULL;
    if (match(parser, TOKEN_ELSE)) {
        consume(parser, TOKEN_LBRACE, "Expected '{' after 'अन्यथा'");
        elseBranch = parseBlock(parser);
        consume(parser, TOKEN_RBRACE, "Expected '}' after else body");
    }
    
    return createIfNode(condition, thenBranch, elseBranch);
}

Phase 3: Semantic Analysis

This phase checks if the program makes sense. We verify that variables are declared before use, function calls match function definitions, and types are compatible.

// Symbol table for tracking variables and functions
typedef struct Symbol {
    char* name;
    SymbolType type;  // VARIABLE, FUNCTION, PARAMETER
    DataType dataType; // INT, STRING, etc.
    int scope;
    struct Symbol* next;
} Symbol;

typedef struct SymbolTable {
    Symbol* symbols;
    int currentScope;
} SymbolTable;

// Semantic analysis
void analyzeNode(ASTNode* node, SymbolTable* symTable) {
    switch (node->type) {
        case NODE_FUNCTION:
            // Add function to symbol table
            addSymbol(symTable, node->function.name, SYMBOL_FUNCTION, TYPE_FUNCTION);
            
            // Enter new scope for function body
            enterScope(symTable);
            
            // Add parameters to symbol table
            for (int i = 0; i < node->function.paramCount; i++) {
                addSymbol(symTable, node->function.parameters[i]->identifier, 
                         SYMBOL_PARAMETER, TYPE_INT);
            }
            
            // Analyze function body
            analyzeNode(node->function.body, symTable);
            
            // Exit function scope
            exitScope(symTable);
            break;
            
        case NODE_IDENTIFIER:
            // Check if variable is declared
            Symbol* symbol = lookupSymbol(symTable, node->identifier);
            if (!symbol) {
                reportError("Undefined variable: %s", node->identifier);
            }
            break;
            
        case NODE_FUNCTION_CALL:
            // Check if function exists
            Symbol* func = lookupSymbol(symTable, node->functionCall.name);
            if (!func || func->type != SYMBOL_FUNCTION) {
                reportError("Undefined function: %s", node->functionCall.name);
            }
            
            // Analyze arguments
            for (int i = 0; i < node->functionCall.argCount; i++) {
                analyzeNode(node->functionCall.arguments[i], symTable);
            }
            break;
    }
}

Phase 4: Code Generation

The final phase converts our AST into target code. For Sakshyar, I chose to generate C code because it’s portable and efficient. Plus, we could leverage existing C compilers for optimization.

Why C as Target?

Generating C code instead of assembly gave us portability across platforms and access to mature C compiler optimizations. It’s like standing on the shoulders of giants.

// Code generator
void generateCode(ASTNode* node, FILE* output) {
    switch (node->type) {
        case NODE_PROGRAM:
            fprintf(output, "#include <stdio.h>\n");
            fprintf(output, "#include <stdlib.h>\n\n");
            
            // Generate all functions
            for (int i = 0; i < node->program.functionCount; i++) {
                generateCode(node->program.functions[i], output);
            }
            break;
            
        case NODE_FUNCTION:
            // Generate function signature
            fprintf(output, "int %s(", translateIdentifier(node->function.name));
            
            // Generate parameters
            for (int i = 0; i < node->function.paramCount; i++) {
                if (i > 0) fprintf(output, ", ");
                fprintf(output, "int %s", 
                       translateIdentifier(node->function.parameters[i]->identifier));
            }
            
            fprintf(output, ") {\n");
            
            // Generate function body
            generateCode(node->function.body, output);
            
            fprintf(output, "}\n\n");
            break;
            
        case NODE_IF_STATEMENT:
            fprintf(output, "if (");
            generateCode(node->ifStmt.condition, output);
            fprintf(output, ") {\n");
            generateCode(node->ifStmt.thenBranch, output);
            fprintf(output, "}\n");
            
            if (node->ifStmt.elseBranch) {
                fprintf(output, "else {\n");
                generateCode(node->ifStmt.elseBranch, output);
                fprintf(output, "}\n");
            }
            break;
            
        case NODE_PRINT_STATEMENT:
            fprintf(output, "printf(\"%%d\\n\", ");
            generateCode(node->printStmt.expression, output);
            fprintf(output, ");\n");
            break;
            
        case NODE_BINARY_OP:
            fprintf(output, "(");
            generateCode(node->binaryOp.left, output);
            fprintf(output, " %s ", node->binaryOp.operator);
            generateCode(node->binaryOp.right, output);
            fprintf(output, ")");
            break;
            
        case NODE_IDENTIFIER:
            fprintf(output, "%s", translateIdentifier(node->identifier));
            break;
            
        case NODE_NUMBER:
            fprintf(output, "%d", convertNepaliNumber(node->number));
            break;
    }
}

The Challenges and Solutions

Unicode Normalization

Nepali text can be written in multiple Unicode representations. “फ” could be one character or multiple combining characters.

Solution: Implemented Unicode normalization in the lexer to ensure consistent token recognition.

Number System Conversion

Supporting both Nepali numerals (१, २, ३) and English numerals (1, 2, 3) in the same program.

Solution: Created conversion functions that handle both number systems seamlessly.

Identifier Translation

Variable names in Nepali needed to be converted to valid C identifiers.

Solution: Implemented a transliteration system that converts Nepali identifiers to ASCII equivalents.

Building and Testing

The build process was straightforward: compile the Sakshyar compiler, use it to translate Nepali source to C, then compile the C code with GCC.

# Build the Sakshyar compiler
gcc -o sakshyar compiler/*.c -lunistring

# Compile a Nepali program
./sakshyar program.sak > program.c
gcc -o program program.c

# Run the program
./program

Example Program

Sakshyar Code:

फंक्शन फ्याक्टोरियल(न) {
    यदि (न <= १) भने {
        फर्काउ १;
    } अन्यथा {
        फर्काउ न * फ्याक्टोरियल(न - १);
    }
}

मुख्य() {
    संख्या = ५;
    परिणाम = फ्याक्टोरियल(संख्या);
    प्रिन्ट(परिणाम);
}

Generated C Code:

#include <stdio.h>
#include <stdlib.h>

int factorial(int n) {
    if (n <= 1) {
        return 1;
    } else {
        return n * factorial(n - 1);
    }
}

int main() {
    int sankhya = 5;
    int parinam = factorial(sankhya);
    printf("%d\n", parinam);
    return 0;
}

Performance and Optimization

Since we generate C code, we get all the optimizations that modern C compilers provide. The performance overhead is minimal - just the translation time from Nepali to C.

Performance Metrics

  • Compilation Speed: ~50ms for 100 lines of Sakshyar code
  • Runtime Performance: Identical to equivalent C code
  • Memory Usage: No runtime overhead
  • Binary Size: Same as compiled C program

Impact and Lessons Learned

The project was showcased at mnsa.cc and received positive feedback from the Nepali developer community. More importantly, it proved that programming languages don’t need to be in English to be powerful.

Key Learnings

  • Compiler design is 70% careful planning, 30% implementation
  • Unicode handling is harder than you think
  • Good error messages are as important as correct compilation
  • Testing with edge cases saves hours of debugging
  • Documentation matters - future you will thank present you

What’s Next?

While Sakshyar was a proof of concept, it opened possibilities for more sophisticated features:

Potential Enhancements

Modern Approaches

Building Your Own Compiler

Want to build your own programming language? Here’s my roadmap based on the Sakshyar experience:

The Step-by-Step Guide

  1. Define your language: What problems will it solve?
  2. Design the grammar: Write it in BNF notation
  3. Build a lexer: Start with simple tokens
  4. Implement the parser: Recursive descent is beginner-friendly
  5. Add semantic analysis: Catch errors early
  6. Choose your target: C, assembly, or bytecode
  7. Generate code: Start simple, optimize later
  8. Test extensively: Edge cases will break your compiler
  9. Add tooling: Debugger, formatter, etc.
  10. Build a community: Languages need users to thrive

Conclusion

Building Sakshyar taught me that compiler design is part computer science, part linguistics, and part art. It’s about understanding how humans think and translating that into something machines can execute.

The project proved that programming languages can and should be accessible to everyone, regardless of their native language. While English dominates the programming world, there’s no technical reason why it has to be that way.

Most importantly, building a compiler from scratch demystified how programming languages work. Every time I use Python, JavaScript, or C++ now, I understand what’s happening under the hood. That understanding has made me a better programmer and system designer.

References