Design a Compiler - Building Nepal's First Nepali Programming Language
The journey of creating Sakshyar, Nepal's first programming language in Nepali, from tokenization to code generation. Because programming shouldn't be limited by language barriers.
During my third year at IOE Pashchimanchal Campus, I had what some might call a crazy idea: What if we could program in Nepali? Not just comments or variable names, but actual syntax, keywords, and control structures. This led to the creation of Sakshyar - Nepal’s first programming language where you could write code like “यदि (x > 5) भने” instead of “if (x > 5) then”.
The “Why” Behind the Madness
The inspiration came from watching my juniors struggle with English syntax while learning programming. They understood the logic perfectly but stumbled over keywords like “while”, “function”, and “return”. I thought: why should language be a barrier to logical thinking?
The Vision
Create a programming language that uses Nepali syntax and keywords, making programming more accessible to Nepali speakers while maintaining the power and flexibility of modern languages.
Compiler Architecture: The Big Picture
Before diving into code, let’s understand what we’re building. A compiler is essentially a translator that converts high-level code into machine-executable instructions. Here’s how Sakshyar’s compiler works:
Source Code (Nepali) → Lexer → Parser → AST → Code Generator → Target Code
Input Example (Sakshyar)
फंक्शन जोड(क, ख) {
फर्काउ क + ख;
}
मुख्य() {
परिणाम = जोड(५, ३);
प्रिन्ट(परिणाम);
}
Generated C Code
int jod(int k, int kh) {
return k + kh;
}
int main() {
int parinam = jod(5, 3);
printf("%d", parinam);
return 0;
}
Phase 1: Lexical Analysis (Tokenization)
The lexer’s job is to break the source code into tokens. This was particularly challenging with Nepali text because of Unicode complexities and different ways to write the same character.
// Token types for Sakshyar
typedef enum {
TOKEN_FUNCTION, // फंक्शन
TOKEN_IF, // यदि
TOKEN_ELSE, // अन्यथा
TOKEN_WHILE, // जबसम्म
TOKEN_RETURN, // फर्काउ
TOKEN_PRINT, // प्रिन्ट
TOKEN_NUMBER, // ५, ३, etc.
TOKEN_IDENTIFIER, // variable names
TOKEN_PLUS, // +
TOKEN_MINUS, // -
TOKEN_MULTIPLY, // *
TOKEN_DIVIDE, // /
TOKEN_LPAREN, // (
TOKEN_RPAREN, // )
TOKEN_LBRACE, // {
TOKEN_RBRACE, // }
TOKEN_SEMICOLON, // ;
TOKEN_EOF
} TokenType;
typedef struct {
TokenType type;
char* value;
int line;
int column;
} Token;
Unicode Challenge
Nepali Unicode characters can be represented in multiple ways. For example, “फ” can be written as a single character or as “फ” + “्” + combining characters. Our lexer needed to normalize these representations.
// Simplified lexer implementation
Token* nextToken(Lexer* lexer) {
skipWhitespace(lexer);
if (isAtEnd(lexer)) {
return createToken(TOKEN_EOF, "");
}
char c = advance(lexer);
// Handle Nepali keywords
if (isNepaliLetter(c)) {
return nepaliIdentifier(lexer);
}
// Handle numbers (both English and Nepali digits)
if (isDigit(c) || isNepaliDigit(c)) {
return number(lexer);
}
// Handle operators and punctuation
switch (c) {
case '+': return createToken(TOKEN_PLUS, "+");
case '-': return createToken(TOKEN_MINUS, "-");
case '*': return createToken(TOKEN_MULTIPLY, "*");
case '/': return createToken(TOKEN_DIVIDE, "/");
case '(': return createToken(TOKEN_LPAREN, "(");
case ')': return createToken(TOKEN_RPAREN, ")");
case '{': return createToken(TOKEN_LBRACE, "{");
case '}': return createToken(TOKEN_RBRACE, "}");
case ';': return createToken(TOKEN_SEMICOLON, ";");
}
// Error handling
reportError(lexer, "Unexpected character");
return NULL;
}
Phase 2: Syntax Analysis (Parsing)
The parser takes tokens and builds an Abstract Syntax Tree (AST). I used a recursive descent parser because it’s easier to understand and debug - crucial when you’re building something from scratch.
// AST Node types
typedef enum {
NODE_PROGRAM,
NODE_FUNCTION,
NODE_IF_STATEMENT,
NODE_WHILE_LOOP,
NODE_RETURN_STATEMENT,
NODE_PRINT_STATEMENT,
NODE_BINARY_OP,
NODE_UNARY_OP,
NODE_FUNCTION_CALL,
NODE_IDENTIFIER,
NODE_NUMBER
} NodeType;
typedef struct ASTNode {
NodeType type;
union {
struct {
char* name;
struct ASTNode** parameters;
int paramCount;
struct ASTNode* body;
} function;
struct {
struct ASTNode* condition;
struct ASTNode* thenBranch;
struct ASTNode* elseBranch;
} ifStmt;
struct {
char* operator;
struct ASTNode* left;
struct ASTNode* right;
} binaryOp;
struct {
char* name;
struct ASTNode** arguments;
int argCount;
} functionCall;
char* identifier;
int number;
};
} ASTNode;
Pro Tip: Grammar First
Before writing parser code, define your language grammar in BNF (Backus-Naur Form). It saves hours of debugging later and helps you think through edge cases.
// Sakshyar Grammar (simplified)
program → function* EOF
function → "फंक्शन" IDENTIFIER "(" parameters? ")" "{" statement* "}"
parameters → IDENTIFIER ("," IDENTIFIER)*
statement → ifStatement | whileStatement | returnStatement |
printStatement | expressionStatement
ifStatement → "यदि" "(" expression ")" "{" statement* "}"
("अन्यथा" "{" statement* "}")?
whileStatement → "जबसम्म" "(" expression ")" "{" statement* "}"
returnStatement→ "फर्काउ" expression ";"
printStatement → "प्रिन्ट" "(" expression ")" ";"
expression → equality
equality → comparison (("==" | "!=") comparison)*
comparison → term ((">" | ">=" | "<" | "<=") term)*
term → factor (("+" | "-") factor)*
factor → unary (("*" | "/") unary)*
unary → ("!" | "-") unary | primary
primary → NUMBER | IDENTIFIER | "(" expression ")" | functionCall
Recursive Descent Parser Implementation
// Parse function declaration
ASTNode* parseFunction(Parser* parser) {
consume(parser, TOKEN_FUNCTION, "Expected 'फंक्शन'");
Token* name = consume(parser, TOKEN_IDENTIFIER, "Expected function name");
consume(parser, TOKEN_LPAREN, "Expected '(' after function name");
// Parse parameters
ASTNode** parameters = NULL;
int paramCount = 0;
if (!check(parser, TOKEN_RPAREN)) {
do {
Token* param = consume(parser, TOKEN_IDENTIFIER, "Expected parameter name");
// Add parameter to list
parameters = realloc(parameters, (paramCount + 1) * sizeof(ASTNode*));
parameters[paramCount] = createIdentifierNode(param->value);
paramCount++;
} while (match(parser, TOKEN_COMMA));
}
consume(parser, TOKEN_RPAREN, "Expected ')' after parameters");
consume(parser, TOKEN_LBRACE, "Expected '{' before function body");
// Parse function body
ASTNode* body = parseBlock(parser);
consume(parser, TOKEN_RBRACE, "Expected '}' after function body");
return createFunctionNode(name->value, parameters, paramCount, body);
}
// Parse if statement
ASTNode* parseIfStatement(Parser* parser) {
consume(parser, TOKEN_IF, "Expected 'यदि'");
consume(parser, TOKEN_LPAREN, "Expected '(' after 'यदि'");
ASTNode* condition = parseExpression(parser);
consume(parser, TOKEN_RPAREN, "Expected ')' after if condition");
consume(parser, TOKEN_LBRACE, "Expected '{' before if body");
ASTNode* thenBranch = parseBlock(parser);
consume(parser, TOKEN_RBRACE, "Expected '}' after if body");
ASTNode* elseBranch = NULL;
if (match(parser, TOKEN_ELSE)) {
consume(parser, TOKEN_LBRACE, "Expected '{' after 'अन्यथा'");
elseBranch = parseBlock(parser);
consume(parser, TOKEN_RBRACE, "Expected '}' after else body");
}
return createIfNode(condition, thenBranch, elseBranch);
}
Phase 3: Semantic Analysis
This phase checks if the program makes sense. We verify that variables are declared before use, function calls match function definitions, and types are compatible.
// Symbol table for tracking variables and functions
typedef struct Symbol {
char* name;
SymbolType type; // VARIABLE, FUNCTION, PARAMETER
DataType dataType; // INT, STRING, etc.
int scope;
struct Symbol* next;
} Symbol;
typedef struct SymbolTable {
Symbol* symbols;
int currentScope;
} SymbolTable;
// Semantic analysis
void analyzeNode(ASTNode* node, SymbolTable* symTable) {
switch (node->type) {
case NODE_FUNCTION:
// Add function to symbol table
addSymbol(symTable, node->function.name, SYMBOL_FUNCTION, TYPE_FUNCTION);
// Enter new scope for function body
enterScope(symTable);
// Add parameters to symbol table
for (int i = 0; i < node->function.paramCount; i++) {
addSymbol(symTable, node->function.parameters[i]->identifier,
SYMBOL_PARAMETER, TYPE_INT);
}
// Analyze function body
analyzeNode(node->function.body, symTable);
// Exit function scope
exitScope(symTable);
break;
case NODE_IDENTIFIER:
// Check if variable is declared
Symbol* symbol = lookupSymbol(symTable, node->identifier);
if (!symbol) {
reportError("Undefined variable: %s", node->identifier);
}
break;
case NODE_FUNCTION_CALL:
// Check if function exists
Symbol* func = lookupSymbol(symTable, node->functionCall.name);
if (!func || func->type != SYMBOL_FUNCTION) {
reportError("Undefined function: %s", node->functionCall.name);
}
// Analyze arguments
for (int i = 0; i < node->functionCall.argCount; i++) {
analyzeNode(node->functionCall.arguments[i], symTable);
}
break;
}
}
Phase 4: Code Generation
The final phase converts our AST into target code. For Sakshyar, I chose to generate C code because it’s portable and efficient. Plus, we could leverage existing C compilers for optimization.
Why C as Target?
Generating C code instead of assembly gave us portability across platforms and access to mature C compiler optimizations. It’s like standing on the shoulders of giants.
// Code generator
void generateCode(ASTNode* node, FILE* output) {
switch (node->type) {
case NODE_PROGRAM:
fprintf(output, "#include <stdio.h>\n");
fprintf(output, "#include <stdlib.h>\n\n");
// Generate all functions
for (int i = 0; i < node->program.functionCount; i++) {
generateCode(node->program.functions[i], output);
}
break;
case NODE_FUNCTION:
// Generate function signature
fprintf(output, "int %s(", translateIdentifier(node->function.name));
// Generate parameters
for (int i = 0; i < node->function.paramCount; i++) {
if (i > 0) fprintf(output, ", ");
fprintf(output, "int %s",
translateIdentifier(node->function.parameters[i]->identifier));
}
fprintf(output, ") {\n");
// Generate function body
generateCode(node->function.body, output);
fprintf(output, "}\n\n");
break;
case NODE_IF_STATEMENT:
fprintf(output, "if (");
generateCode(node->ifStmt.condition, output);
fprintf(output, ") {\n");
generateCode(node->ifStmt.thenBranch, output);
fprintf(output, "}\n");
if (node->ifStmt.elseBranch) {
fprintf(output, "else {\n");
generateCode(node->ifStmt.elseBranch, output);
fprintf(output, "}\n");
}
break;
case NODE_PRINT_STATEMENT:
fprintf(output, "printf(\"%%d\\n\", ");
generateCode(node->printStmt.expression, output);
fprintf(output, ");\n");
break;
case NODE_BINARY_OP:
fprintf(output, "(");
generateCode(node->binaryOp.left, output);
fprintf(output, " %s ", node->binaryOp.operator);
generateCode(node->binaryOp.right, output);
fprintf(output, ")");
break;
case NODE_IDENTIFIER:
fprintf(output, "%s", translateIdentifier(node->identifier));
break;
case NODE_NUMBER:
fprintf(output, "%d", convertNepaliNumber(node->number));
break;
}
}
The Challenges and Solutions
Unicode Normalization
Nepali text can be written in multiple Unicode representations. “फ” could be one character or multiple combining characters.
Solution: Implemented Unicode normalization in the lexer to ensure consistent token recognition.
Number System Conversion
Supporting both Nepali numerals (१, २, ३) and English numerals (1, 2, 3) in the same program.
Solution: Created conversion functions that handle both number systems seamlessly.
Identifier Translation
Variable names in Nepali needed to be converted to valid C identifiers.
Solution: Implemented a transliteration system that converts Nepali identifiers to ASCII equivalents.
Building and Testing
The build process was straightforward: compile the Sakshyar compiler, use it to translate Nepali source to C, then compile the C code with GCC.
# Build the Sakshyar compiler
gcc -o sakshyar compiler/*.c -lunistring
# Compile a Nepali program
./sakshyar program.sak > program.c
gcc -o program program.c
# Run the program
./program
Example Program
Sakshyar Code:
फंक्शन फ्याक्टोरियल(न) {
यदि (न <= १) भने {
फर्काउ १;
} अन्यथा {
फर्काउ न * फ्याक्टोरियल(न - १);
}
}
मुख्य() {
संख्या = ५;
परिणाम = फ्याक्टोरियल(संख्या);
प्रिन्ट(परिणाम);
}
Generated C Code:
#include <stdio.h>
#include <stdlib.h>
int factorial(int n) {
if (n <= 1) {
return 1;
} else {
return n * factorial(n - 1);
}
}
int main() {
int sankhya = 5;
int parinam = factorial(sankhya);
printf("%d\n", parinam);
return 0;
}
Performance and Optimization
Since we generate C code, we get all the optimizations that modern C compilers provide. The performance overhead is minimal - just the translation time from Nepali to C.
Performance Metrics
- Compilation Speed: ~50ms for 100 lines of Sakshyar code
- Runtime Performance: Identical to equivalent C code
- Memory Usage: No runtime overhead
- Binary Size: Same as compiled C program
Impact and Lessons Learned
The project was showcased at mnsa.cc and received positive feedback from the Nepali developer community. More importantly, it proved that programming languages don’t need to be in English to be powerful.
Key Learnings
- Compiler design is 70% careful planning, 30% implementation
- Unicode handling is harder than you think
- Good error messages are as important as correct compilation
- Testing with edge cases saves hours of debugging
- Documentation matters - future you will thank present you
What’s Next?
While Sakshyar was a proof of concept, it opened possibilities for more sophisticated features:
Potential Enhancements
- Object-oriented features
- Standard library in Nepali
- Better IDE support
- Debugging tools
- Package management
Modern Approaches
- LLVM backend for better optimization
- Language server protocol support
- WebAssembly compilation target
- JIT compilation
- Gradual typing system
Building Your Own Compiler
Want to build your own programming language? Here’s my roadmap based on the Sakshyar experience:
The Step-by-Step Guide
- Define your language: What problems will it solve?
- Design the grammar: Write it in BNF notation
- Build a lexer: Start with simple tokens
- Implement the parser: Recursive descent is beginner-friendly
- Add semantic analysis: Catch errors early
- Choose your target: C, assembly, or bytecode
- Generate code: Start simple, optimize later
- Test extensively: Edge cases will break your compiler
- Add tooling: Debugger, formatter, etc.
- Build a community: Languages need users to thrive
Conclusion
Building Sakshyar taught me that compiler design is part computer science, part linguistics, and part art. It’s about understanding how humans think and translating that into something machines can execute.
The project proved that programming languages can and should be accessible to everyone, regardless of their native language. While English dominates the programming world, there’s no technical reason why it has to be that way.
Most importantly, building a compiler from scratch demystified how programming languages work. Every time I use Python, JavaScript, or C++ now, I understand what’s happening under the hood. That understanding has made me a better programmer and system designer.
References
- Crafting Interpreters by Bob Nystrom - Excellent compiler design book
- LLVM Tutorial - Modern compiler backend
- Sakshyar Source Code - Full implementation
- Project Website - Demo and documentation