Merge pull request 'document-packages' (#30) from document-packages into main

Reviewed-on: sashakoshka/fspl#30
This commit is contained in:
Sasha Koshka 2024-02-10 23:52:17 +00:00
commit 1593ecef7b
8 changed files with 370 additions and 85 deletions

89
analyzer/README.md Normal file
View File

@ -0,0 +1,89 @@
# analyzer
## Responsibilities
- Define syntax tree type that contains entities
- Turn streams of tokens into abstract syntax tree entities
## Organization
The entry point for all logic defined in this package is the Tree type. On this
type, the Analyze() method is defined. This method checks the semantic
correctness of an AST, fills in semantic fields within its data structures, and
arranges them into the Tree.
Tree contains a scopeContextManager. The job of scopeContextManager is to manage
a stack of scopeContexts, which are each tied to a function or method that is
currently being analyzed. In turn, each scopeContext manages stacks of
entity.Scopes and entity.Loops. This allows for greedy/recursive analysis of
functions and methods.
## Operation
When the analyze method is called, several hidden fields in the Tree are filled
out. Tree.ensure() instantiates data that can persist between analyses, which
consists of map initialization and merging the data in the builtinTypes map into
Tree.Types.
After Tree.ensure completes, Tree.assembleRawMaps() takes top-level entities
from the AST and organizes them into rawTypes, rawFunctions, and rawMethods. It
does this so that top-level entites can be indexed by name. While doing this, it
ensures that function and type names are unique, and method names are unique
within the type they are defined on.
Next, Tree.analyzeDeclarations() is called. This is the entry point for the
actual analysis logic. For each item in the raw top-level entity maps, it calls
a specific analysis routine, which is one of:
- Tree.analyzeTypedef()
- Tree.analyzeFunction()
- Tree.analyzeMethod()
These routines all have two crucial properties that make them very useful:
- They refer to top-level entities by name instead of by memory location
- If the entity has already been analyzed, they return that entity instead of
analyzing it again
Because of this, they are also used as accessors for top level entities within
more specific analysis routines. For example, the routine Tree.analyzeCall()
will call Tree.analyzeFunction() in order to get information about the function
that is being called. If the function has not yet been analyzed, it is analyzed
(making use of scopeContextManager to push a new scopeContext), and other
routines (including Tree.analyzeDeclarations()) will not have to analyze it all
over agian. After a top-level entity has been analyzed, these routines will
always return the same pointer to the one instance of the analyzed entity.
## Expression Analysis and Assignment
Since expressions make up the bulk of FSPL, expression analysis makes up the
bulk of the semantic analyzer. Whenever an expression needs to be analyzed,
Tree.analyzeExpression() is called. This activates a switch to call one of many
specialized analysis routines based on the expression entity's concrete type.
Much of expression analysis consists of the analyze checking to see if the
result of one expression can be assigned to the input of another. To this end,
assignment rules are used. There are five different assignment modes:
- Strict: Structural equivalence, but named types are treated as opaque and are
not tested. This applies to the root of the type, and to types enclosed as
members, elements, etc. This is the assignment mode most often used.
- Weak: Like strict, but the root types specifically are compared as if they
were not named. analyzer.ReduceToBase() is used to accomplish this.
- Structural: Full structural equivalence, and named types are always reduced.
- Coerce: Data of the source type must be convert-able to the destination type.
This is used in value casts.
- Force: All assignment rules are ignored. This is only used in bit casts.
All expression analysis routines take in as a parameter the type that the result
expression is being assigned to, and the assignment mode. To figure out whether
or not they can be assigned, they in turn (usually) call Tree.canAssign().
Tree.canAssign() is used to determine whether data of a source type can be
assigned to a destination type, given an assignment mode. However, it is not
called automatically by Tree.analyzeExpression() because:
- Determining the source type is sometimes non-trivial (see
Tree.analyzeOperation())
- Literals have their own very weak assignment rules, and are designed to be
assignable to a wide range of data types

View File

@ -7,16 +7,20 @@ import "git.tebibyte.media/sashakoshka/fspl/entity"
import "git.tebibyte.media/sashakoshka/fspl/integer"
type strictness int; const (
// name equivalence
// Structural equivalence, but named types are treated as opaque and are
// not tested. This applies to the root of the type, and to types
// enclosed as members, elements, etc. This is the assignment mode most
// often used.
strict strictness = iota
// structural equivalence up until the first base type, then name
// equivalence applies to the parts of the type
// Like strict, but the root types specifically are compared as if they
// were not named. analyzer.ReduceToBase() is used to accomplish this.
weak
// structural equivalence
// Full structural equivalence, and named types are always reduced.
structural
// allow if values can be converted
// Data of the source type must be convert-able to the destination type.
// This is used in value casts.
coerce
// assignment rules are completely ignored and everything is accepted
// All assignment rules are ignored. This is only used in bit casts.
force
)

View File

@ -145,7 +145,7 @@ referring to usually the name of a function. The result of a call may be assigne
any type matching the function's return type. Since it contains inherent type
information, it may be directly assigned to an interface.
### Method call
Method call calls upon the method of the variable before the dot that is
Method call calls upon the method (of the expression before the dot) that is
specified by the first argument, passing the rest of the arguments to the
method. The first argument must be a method name. The result of a call may be
assigned to any type matching the method's return type. Since it contains
@ -258,12 +258,15 @@ does not return anything, the return statement does not accept a value. In all
cases, return statements have no value and may not be assigned to anything.
### Assignment
Assignment allows assigning the result of one expression to one or more location
expressions. The assignment statement itself has no value and may not be
expressions. The assignment expression itself has no value and may not be
assigned to anything.
# Syntax entities
Below is a rough syntax description of the language.
Below is a rough syntax description of the language. Note that `<assignment>`
is right-associative, and `<memberAccess>` and `<methodCall>` are
left-associative. I invite you to torture yourself by attempting to implement
this without hand-writing a parser.
```
<file> -> (<typedef> | <function> | <method>)*
@ -281,8 +284,8 @@ Below is a rough syntax description of the language.
<pointerType> -> "*" <type>
<sliceType> -> "*" ":" <type>
<arrayType> -> <intLiteral> ":" <type>
<structType> -> "(" <declaration>* ")"
<interfaceType> -> "(" <signature> ")"
<structType> -> "(" "." <declaration>* ")"
<interfaceType> -> "(" "~" <signature>* ")"
<expression> -> <intLiteral>
| <floatLiteral>
@ -302,25 +305,26 @@ Below is a rough syntax description of the language.
| <operation>
| <block>
| <memberAccess>
| <methodCall>
| <ifelse>
| <loop>
| <break>
| <return>
<statement> -> <expression> | <assignment>
| <assignment>
<variable> -> <identifier>
<declaration> -> <identifier> ":" <type>
<call> -> "[" <expression>+ "]"
<subscript> -> "[" "." <expression> <expression> "]"
<slice> -> "[" "\" <expression> <expression>? ":" <expression>? "]"
<slice> -> "[" "\" <expression> <expression>? "/" <expression>? "]"
<length> -> "[" "#" <expression> "]"
<dereference> -> "[" "." <expression> "]"
<reference> -> "[" "@" <expression> "]"
<valueCast> -> "[" "~" <type> <expression> "]"
<bitCast> -> "[" "~~" <type> <expression> "]"
<operation> -> "[" <operator> <expression>* "]"
<block> -> "{" <statement>* "}"
<memberAccess> -> <variable> "." <identifier>
<methodAccess> -> <variable> "." <call>
<block> -> "{" <expression>* "}"
<memberAccess> -> <expression> "." <identifier>
<methodCall> -> <expression> "." <call>
<ifelse> -> "if" <expression>
"then" <expression>
["else" <expression>]
@ -336,7 +340,7 @@ Below is a rough syntax description of the language.
<floatLiteral> -> /-?[0-9]*\.[0-9]+/
<stringLiteral> -> /'.*'/
<arrayLiteral> -> "(*" <expression>* ")"
<structLiteral> -> "(" <member>* ")"
<structLiteral> -> "(." <member>* ")"
<booleanLiteral> -> "true" | "false"
<member> -> <identifier> ":" <expression>

86
generator/README.md Normal file
View File

@ -0,0 +1,86 @@
# generator
## Responsibilities
Given a compilation target, turn a well-formed FSPL semantic tree into an LLVM
IR module tree.
## Organization
Generator defines the Target type, which contains information about the system
that the program is being compiled for. The native sub-package uses Go's
conditional compilation directives to provide a default Target that matches the
system the compiler has been natively built for.
The entry point for all logic defined in this package is Target.Generate(). This
method creates a new generator, and uses it to recursively generate and return an
LLVM module. The details of the generator are hidden from other packages, and
instances of it only last for the duration of Target.Generate().
The generator contains a stack of blockManagers, which plays a similar role to
analyzer.scopeContextManager, except that the stack of blockManagers is managed
directly by the generator, which contains appropriate methods for
pushing/popping them.
Like the analyzer, the generator greedily generates code, and one function may
be generated in the middle of the generation process of another function. Thus,
each blockManager is tied to a specific LLVM function, and is in charge of
variables/stack allocations and to a degree, control flow flattening
(specifically loops). It also embeds the current active block, allowing for
generator routines to call its methods to add new instructions to the current
block, and switch between different blocks when necessary.
## Operation
When Target.Generate() is called, a new generator is created. It is given the
semantic tree to generate, as well as a copy of the Target. All data structure
initialization within the generator happens at this point.
Then, the generate() method on the newly created generator is called. This is
the entry point for the actual generation logic. This routine is comprised of
two phases:
- Function generation
- Method generation
You'll notice that there is no step for type generation. This is because types
are generated on-demand in order to reduce IR clutter.
## Expression Generation
Since expressions make up the bulk of FSPL, expression generation makes up the
bulk of the code generator. The generator is able to produce expressions in one
of three modes:
- Location: The generator will return an IR register that contains a pointer to
the result of the expression.
- Value: The generator will return an IR register that directly contains the
result of the expression.
- Any: The generator will decide which of these two options is best for the
specific expression, and will let the caller know which was chosen, in case it
cares. Some expressions are better suited to returning a pointer, such as
array subscripting or member access. Other expressions are better suited to
returning a value, such as arithmetic operators and function calls.
It is important to note that generating a Value expression may return a pointer,
because *FSPL pointers are first-class values*. The distinction between location
and value generation modes is purely to do with LLVM. It is similar to the
concept of location expressions within the analyzer, but not 100% identical all
of the time.
Whenever an expression needs to be generated, one of the following routines is
called:
- generator.generateExpression()
- generator.generateAny()
- generator.generateVal()
- generator.generateLoc()
The generator.generateExpression() routine takes in a mode value and depending
on it, calls one of the other more specific routines. Each of these routines, in
turn, calls a more specialized generation routine depending on the specific
expression.
If it is specifically requested to generate a value for an expression with only
its location component defined or vice versa, generator.generateVal/Loc() will
automatically perform the conversion.

42
lexer/README.md Normal file
View File

@ -0,0 +1,42 @@
# lexer
## Responsibilities
- Define token type, token kinds
- Turning streams of data into streams of tokens
## Organization
The lexer is split into its interface and implementation:
- Lexer: public facing lexer interface
- fsplLexer: private implementation of Lexer, with public constructors
The lexer is bound to a data stream at the time of creation, and its Next()
method may be called to read and return the next token from the stream.
## Operation
fsplLexer carries state information about what rune from the data stream is
currently being processed. This must always be filled out as long as there is
still data in the stream to read from. All lexer routines start off by using
this rune, and end by advancing to the next rune for the next routine to use.
The lexer follows this general flow:
1. Upon creation, grab the first rune to initialize the lexer state
2. When next is called...
3. Create a new token
4. Set the token's position
5. Switch off of the current rune to set the token's kind and invoke specific
lexing behavior
6. Expand the token's position to cover the full range
When an EOF is detected, the lexer is marked as spent (eof: true) and will only
return EOF tokens. The lexer will only return an error alongside an EOF token if
the EOF was unexpected.
The lexer also keeps track of its current position in order to embed it into
tokens, and to print errors. It is important that the lowest level operation
used to advance the lexer's position is fsplLexer.nextRune(), as it contains
logic for keeping the position correct and maintaining the current lexer state.

128
parser/README.md Normal file
View File

@ -0,0 +1,128 @@
# parser
## Responsibilities
- Define syntax tree type that contains entities
- Turn streams of tokens into abstract syntax tree entities
## Organization
The entry point for all logic defined in this package is the Tree type. On this
type, the Parse() method is defined. This method creates a new parser, and uses
it to parse a stream of tokens into the tree. The details of the parser are
hidden from other packages, and instances of it only last for the duration of
Tree.Parse().
## Operation
The parser holds a pointer to the Tree that created it, as well as the lexer
that was passed to it. Its parse() method attempts to consume all tokens
produced by the lexer, parsing them into syntax entities which it places into
the tree.
parser.parse() parses top level entities, which include functions, methods, and
typedefs. For each top-level entity, the parser will call a specialized parsing
routine to parse that entity depending on the current token's kind and value.
These routines in turn call other routines, which call other routines, etc.
All parsing routines follow this general flow:
- Start with the token already present in Parser.token. Do not get the
token after it.
- Use Parser.expect(), Parser.expectValue(), etc. to test whether the token
is a valid start for the entity
- If starting by calling another parsing method, trust that method to do
this instead.
- When getting new tokens, use Parser.expectNext(),
Parser.expectNextDesc(), etc. Only use Parser.next() when getting a token
*right before* calling another parsing method, or at the *very end* of
the current method.
- To terminate the method, get the next token and do nothing with it.
- If terminating by calling another parsing method, trust that method to do
this instead.
Remember that parsing routines always start with the current token, and end by
getting a trailing token for the next method to start with. This makes it
possible to reliably switch between parsing methods depending on the type or
value of a token.
The parser must never backtrack or look ahead, but it may revise previous
data it has output upon receiving a new token that comes directly after the
last token of said previous data. For example:
- X in XYZ may not be converted to A once the parser has seen Z, but
- X in XYZ may be converted to A once the parser has seen Y.
This disallows complex and ambiguous syntax, but should allow things such as
the very occasional infix operator (like . and =)
### Expression Parsing
Expression notation is the subset of FSPL that is used to describe
computations and data/control flow. The expression parsing routine is the most
important part of the FSPL parser, and also the most complex. For each
expression, the parser follows this decision tree to determine what to parse:
```
| +Ident =Variable
| | 'true' =LiteralBoolean
| | 'false' =LiteralBoolean
| | 'nil' =LiteralNil
| | 'if' =IfElse
| | 'loop' =Loop
| | +Colon =Declaration
| | +DoubleColon =Call
|
| +LParen X
| | +Star =LiteralArray
| | +Dot =LiteralStruct
|
| +LBracket | +Ident =Call
| | | 'break' =Break
| | | 'return' =Return
| |
| | +Dot +Expression =Dereference
| | | +Expression =Subscript
| | +Star =Operation
| |
| | +Symbol X
| | '\' =Slice
| | '#' =Length
| | '@' =Reference
| | '~' =ValueCast
| | '~~' =BitCast
| | OPERATOR =Operation
|
| +LBrace =Block
| +Int =LiteralInt
| +Float =LiteralFloat
| +String =LiteralString
```
Each branch of the decision tree is implemented as a routine with one or more
switch statements which call other routines, and each leaf is implemented as a
normal entity parsing routine.
Expressions that are only detected after more than one token has been
consumed have parsing routines with "-Core" appended to them. This means that
they do not begin at the first token in the expression, but instead at the point
where there is no longer any ambiguity as to what they are.
### Type Parsing
Type notation is the subset of FSPL that is used to describe data types. Though
it is not as complex as expression notation, it still needs a decision tree to
determine what type to parse:
```
| +Ident =TypeNamed
| +TypeIdent =TypeNamed
| +Int =TypeArray
|
| +LParen X
| | +Dot =TypeStruct
| | +Symbol '~' =TypeInterface
|
| +Star =TypePointer
| +Colon =TypeSlice
```

View File

@ -4,44 +4,6 @@ import "git.tebibyte.media/sashakoshka/fspl/lexer"
import "git.tebibyte.media/sashakoshka/fspl/errors"
import "git.tebibyte.media/sashakoshka/fspl/entity"
// Expression decision tree flow:
//
// | +Ident =Variable
// | | 'true' =LiteralBoolean
// | | 'false' =LiteralBoolean
// | | 'nil' =LiteralNil
// | | 'if' =IfElse
// | | 'loop' =Loop
// | | +Colon =Declaration
// | | +DoubleColon =Call
// |
// | +LParen X
// | | +Star =LiteralArray
// | | +Dot =LiteralStruct
// |
// | +LBracket | +Ident =Call
// | | | 'break' =Break
// | | | 'return' =Return
// | |
// | | +Dot +Expression =Dereference
// | | | +Expression =Subscript
// | | +Star =Operation
// | |
// | | +Symbol X
// | | '\' =Slice
// | | '#' =Length
// | | '@' =Reference
// | | '~' =ValueCast
// | | '~~' =BitCast
// | | OPERATOR =Operation
// |
// | +LBrace =Block
// | +Int =LiteralInt
// | +Float =LiteralFloat
// | +String =LiteralString
//
// Entities with a star have yet to be implemented
var descriptionExpression = "expression"
var startTokensExpression = []lexer.TokenKind {
lexer.Ident,

View File

@ -5,36 +5,6 @@ import "fmt"
import "git.tebibyte.media/sashakoshka/fspl/lexer"
import "git.tebibyte.media/sashakoshka/fspl/errors"
// When writing a parsing method on Parser, follow this flow:
// - Start with the token already present in Parser.token. Do not get the
// token after it.
// - Use Parser.expect(), Parser.expectValue(), etc. to test whether the token
// is a valid start for the entity
// - If starting by calling another parsing method, trust that method to do
// this instead.
// - When getting new tokens, use Parser.expectNext(),
// Parser.expectNextDesc(), etc. Only use Parser.next() when getting a token
// *right before* calling another parsing method, or at the *very end* of
// the current method.
// - To terminate the method, get the next token and do nothing with it.
// - If terminating by calling another parsing method, trust that method to do
// this instead.
//
// Remember that parsing methods always start with the current token, and end by
// getting a trailing token for the next method to start with. This makes it
// possible to reliably switch between parsing methods depending on the type or
// value of a token.
//
// The parser must never backtrack or look ahead, but it may revise previous
// data it has output upon receiving a new token that comes directly after the
// last token of said previous data. For example:
//
// X in XYZ may not be converted to A once the parser has seen Z, but
// X in XYZ may be converted to A once the parser has seen Y.
//
// This disallows complex and ambiguous syntax, but should allow things such as
// the very occasional infix operator (like . and =)
// Parser parses tokens from a lexer into syntax entities, which it places into
// a tree.
type Parser struct {