Merge pull request 'document-packages' (#30) from document-packages into main

Reviewed-on: sashakoshka/fspl#30
2024-02-10 23:52:17 +00:00 · 2024-02-10 23:52:17 +00:00 · 1593ecef7b
commit 1593ecef7b
parent 52da21b60d b0c5130858
8 changed files with 370 additions and 85 deletions
--- a/analyzer/README.md
+++ b/analyzer/README.md
@ -0,0 +1,89 @@
+# analyzer
+
+## Responsibilities
+
+- Define syntax tree type that contains entities
+- Turn streams of tokens into abstract syntax tree entities
+
+## Organization
+
+The entry point for all logic defined in this package is the Tree type. On this
+type, the Analyze() method is defined. This method checks the semantic
+correctness of an AST, fills in semantic fields within its data structures, and
+arranges them into the Tree.
+
+Tree contains a scopeContextManager. The job of scopeContextManager is to manage
+a stack of scopeContexts, which are each tied to a function or method that is
+currently being analyzed. In turn, each scopeContext manages stacks of
+entity.Scopes and entity.Loops. This allows for greedy/recursive analysis of
+functions and methods.
+
+## Operation
+
+When the analyze method is called, several hidden fields in the Tree are filled
+out. Tree.ensure() instantiates data that can persist between analyses, which
+consists of map initialization and merging the data in the builtinTypes map into
+Tree.Types.
+
+After Tree.ensure completes, Tree.assembleRawMaps() takes top-level entities
+from the AST and organizes them into rawTypes, rawFunctions, and rawMethods. It
+does this so that top-level entites can be indexed by name. While doing this, it
+ensures that function and type names are unique, and method names are unique
+within the type they are defined on.
+
+Next, Tree.analyzeDeclarations() is called. This is the entry point for the
+actual analysis logic. For each item in the raw top-level entity maps, it calls
+a specific analysis routine, which is one of:
+
+- Tree.analyzeTypedef()
+- Tree.analyzeFunction()
+- Tree.analyzeMethod()
+
+These routines all have two crucial properties that make them very useful:
+
+- They refer to top-level entities by name instead of by memory location
+- If the entity has already been analyzed, they return that entity instead of
+  analyzing it again
+
+Because of this, they are also used as accessors for top level entities within
+more specific analysis routines. For example, the routine Tree.analyzeCall()
+will call Tree.analyzeFunction() in order to get information about the function
+that is being called. If the function has not yet been analyzed, it is analyzed
+(making use of scopeContextManager to push a new scopeContext), and other
+routines (including Tree.analyzeDeclarations()) will not have to analyze it all
+over agian. After a top-level entity has been analyzed, these routines will
+always return the same pointer to the one instance of the analyzed entity.
+
+## Expression Analysis and Assignment
+
+Since expressions make up the bulk of FSPL, expression analysis makes up the
+bulk of the semantic analyzer. Whenever an expression needs to be analyzed,
+Tree.analyzeExpression() is called. This activates a switch to call one of many
+specialized analysis routines based on the expression entity's concrete type.
+
+Much of expression analysis consists of the analyze checking to see if the
+result of one expression can be assigned to the input of another. To this end,
+assignment rules are used. There are five different assignment modes:
+
+- Strict: Structural equivalence, but named types are treated as opaque and are
+  not tested. This applies to the root of the type, and to types enclosed as
+  members, elements, etc. This is the assignment mode most often used.
+- Weak: Like strict, but the root types specifically are compared as if they
+  were not named. analyzer.ReduceToBase() is used to accomplish this.
+- Structural: Full structural equivalence, and named types are always reduced.
+- Coerce: Data of the source type must be convert-able to the destination type.
+  This is used in value casts.
+- Force: All assignment rules are ignored. This is only used in bit casts.
+
+
+All expression analysis routines take in as a parameter the type that the result
+expression is being assigned to, and the assignment mode. To figure out whether
+or not they can be assigned, they in turn (usually) call Tree.canAssign().
+Tree.canAssign() is used to determine whether data of a source type can be
+assigned to a destination type, given an assignment mode. However, it is not
+called automatically by Tree.analyzeExpression() because:
+
+- Determining the source type is sometimes non-trivial (see
+  Tree.analyzeOperation())
+- Literals have their own very weak assignment rules, and are designed to be
+  assignable to a wide range of data types
--- a/analyzer/assignment.go
+++ b/analyzer/assignment.go
@ -7,16 +7,20 @@ import "git.tebibyte.media/sashakoshka/fspl/entity"
 import "git.tebibyte.media/sashakoshka/fspl/integer"

 type strictness int; const (
-	// name equivalence
+	// Structural equivalence, but named types are treated as opaque and are
+	// not tested. This applies to the root of the type, and to types
+	// enclosed as members, elements, etc. This is the assignment mode most
+	// often used.
 	strict strictness = iota
-	// structural equivalence up until the first base type, then name
-	// equivalence applies to the parts of the type
+	// Like strict, but the root types specifically are compared as if they
+	// were not named. analyzer.ReduceToBase() is used to accomplish this.
 	weak
-	// structural equivalence
+	// Full structural equivalence, and named types are always reduced.
 	structural
-	// allow if values can be converted
+	// Data of the source type must be convert-able to the destination type.
+	// This is used in value casts.
 	coerce
-	// assignment rules are completely ignored and everything is accepted
+	// All assignment rules are ignored. This is only used in bit casts.
 	force
 )

--- a/design/spec.md
+++ b/design/spec.md
@ -145,7 +145,7 @@ referring to usually the name of a function. The result of a call may be assigne
 any type matching the function's return type. Since it contains inherent type
 information, it may be directly assigned to an interface.
 ### Method call
-Method call calls upon the method of the variable before the dot that is
+Method call calls upon the method (of the expression before the dot) that is
 specified by the first argument, passing the rest of the arguments to the
 method. The first argument must be a method name. The result of a call may be
 assigned to any type matching the method's return type. Since it contains
@ -258,12 +258,15 @@ does not return anything, the return statement does not accept a value. In all
 cases, return statements have no value and may not be assigned to anything.
 ### Assignment
 Assignment allows assigning the result of one expression to one or more location
-expressions. The assignment statement itself has no value and may not be
+expressions. The assignment expression itself has no value and may not be
 assigned to anything.

 # Syntax entities

-Below is a rough syntax description of the language.
+Below is a rough syntax description of the language. Note that `<assignment>`
+is right-associative, and `<memberAccess>` and `<methodCall>` are
+left-associative. I invite you to torture yourself by attempting to implement
+this without hand-writing a parser.

 ```
 <file>     -> (<typedef> | <function> | <method>)*
@ -281,8 +284,8 @@ Below is a rough syntax description of the language.
 <pointerType>   -> "*" <type>
 <sliceType>     -> "*" ":" <type>
 <arrayType>     -> <intLiteral> ":" <type>
-<structType>    -> "(" <declaration>* ")"
-<interfaceType> -> "(" <signature> ")"
+<structType>    -> "(" "." <declaration>* ")"
+<interfaceType> -> "(" "~" <signature>* ")"

 <expression> -> <intLiteral>
              | <floatLiteral>
@ -302,25 +305,26 @@ Below is a rough syntax description of the language.
              | <operation>
              | <block>
              | <memberAccess>
+              | <methodCall>
              | <ifelse>
              | <loop>
              | <break>
              | <return>
-<statement>    -> <expression> | <assignment>
+              | <assignment>
 <variable>     -> <identifier>
 <declaration>  -> <identifier> ":" <type>
 <call>         -> "[" <expression>+ "]"
 <subscript>    -> "[" "." <expression> <expression> "]"
-<slice>        -> "[" "\" <expression> <expression>? ":" <expression>? "]"
+<slice>        -> "[" "\" <expression> <expression>? "/" <expression>? "]"
 <length>       -> "[" "#" <expression> "]"
 <dereference>  -> "[" "." <expression> "]"
 <reference>    -> "[" "@" <expression> "]"
 <valueCast>    -> "[" "~"    <type> <expression> "]"
 <bitCast>      -> "[" "~~" <type> <expression> "]"
 <operation>    -> "[" <operator> <expression>* "]"
-<block>        -> "{" <statement>* "}"
-<memberAccess> -> <variable> "." <identifier>
-<methodAccess> -> <variable> "." <call>
+<block>        -> "{" <expression>* "}"
+<memberAccess> -> <expression> "." <identifier>
+<methodCall>   -> <expression> "." <call>
 <ifelse>       -> "if"   <expression>
                  "then" <expression>
                 ["else" <expression>]
@ -336,7 +340,7 @@ Below is a rough syntax description of the language.
 <floatLiteral>   -> /-?[0-9]*\.[0-9]+/
 <stringLiteral>  -> /'.*'/
 <arrayLiteral>   -> "(*" <expression>* ")"
-<structLiteral>  -> "(" <member>* ")"
+<structLiteral>  -> "(." <member>* ")"
 <booleanLiteral> -> "true" | "false"

 <member>         -> <identifier> ":" <expression>
--- a/generator/README.md
+++ b/generator/README.md
@ -0,0 +1,86 @@
+# generator
+
+## Responsibilities
+
+Given a compilation target, turn a well-formed FSPL semantic tree into an LLVM
+IR module tree.
+
+## Organization
+
+Generator defines the Target type, which contains information about the system
+that the program is being compiled for. The native sub-package uses Go's
+conditional compilation directives to provide a default Target that matches the
+system the compiler has been natively built for.
+
+The entry point for all logic defined in this package is Target.Generate(). This
+method creates a new generator, and uses it to recursively generate and return an
+LLVM module. The details of the generator are hidden from other packages, and
+instances of it only last for the duration of Target.Generate().
+
+The generator contains a stack of blockManagers, which plays a similar role to
+analyzer.scopeContextManager, except that the stack of blockManagers is managed
+directly by the generator, which contains appropriate methods for
+pushing/popping them.
+
+Like the analyzer, the generator greedily generates code, and one function may
+be generated in the middle of the generation process of another function. Thus,
+each blockManager is tied to a specific LLVM function, and is in charge of
+variables/stack allocations and to a degree, control flow flattening
+(specifically loops). It also embeds the current active block, allowing for
+generator routines to call its methods to add new instructions to the current
+block, and switch between different blocks when necessary.
+
+## Operation
+
+When Target.Generate() is called, a new generator is created. It is given the
+semantic tree to generate, as well as a copy of the Target. All data structure
+initialization within the generator happens at this point.
+
+Then, the generate() method on the newly created generator is called. This is
+the entry point for the actual generation logic. This routine is comprised of
+two phases:
+
+- Function generation
+- Method generation
+
+You'll notice that there is no step for type generation. This is because types
+are generated on-demand in order to reduce IR clutter.
+
+## Expression Generation
+
+Since expressions make up the bulk of FSPL, expression generation makes up the
+bulk of the code generator. The generator is able to produce expressions in one
+of three modes:
+
+- Location: The generator will return an IR register that contains a pointer to
+  the result of the expression.
+- Value: The generator will return an IR register that directly contains the
+  result of the expression.
+- Any: The generator will decide which of these two options is best for the
+  specific expression, and will let the caller know which was chosen, in case it
+  cares. Some expressions are better suited to returning a pointer, such as
+  array subscripting or member access. Other expressions are better suited to
+  returning a value, such as arithmetic operators and function calls.
+
+It is important to note that generating a Value expression may return a pointer,
+because *FSPL pointers are first-class values*. The distinction between location
+and value generation modes is purely to do with LLVM. It is similar to the
+concept of location expressions within the analyzer, but not 100% identical all
+of the time.
+
+Whenever an expression needs to be generated, one of the following routines is
+called:
+
+- generator.generateExpression()
+- generator.generateAny()
+- generator.generateVal()
+- generator.generateLoc()
+
+The generator.generateExpression() routine takes in a mode value and depending
+on it, calls one of the other more specific routines. Each of these routines, in
+turn, calls a more specialized generation routine depending on the specific
+expression.
+
+If it is specifically requested to generate a value for an expression with only
+its location component defined or vice versa, generator.generateVal/Loc() will
+automatically perform the conversion.
--- a/lexer/README.md
+++ b/lexer/README.md
@ -0,0 +1,42 @@
+# lexer
+
+## Responsibilities
+
+- Define token type, token kinds
+- Turning streams of data into streams of tokens
+
+## Organization
+
+The lexer is split into its interface and implementation:
+
+- Lexer: public facing lexer interface
+- fsplLexer: private implementation of Lexer, with public constructors
+
+The lexer is bound to a data stream at the time of creation, and its Next()
+method may be called to read and return the next token from the stream.
+
+## Operation
+
+fsplLexer carries state information about what rune from the data stream is
+currently being processed. This must always be filled out as long as there is
+still data in the stream to read from. All lexer routines start off by using
+this rune, and end by advancing to the next rune for the next routine to use.
+
+The lexer follows this general flow:
+
+1. Upon creation, grab the first rune to initialize the lexer state
+2. When next is called...
+3. Create a new token
+4. Set the token's position
+5. Switch off of the current rune to set the token's kind and invoke specific
+   lexing behavior
+6. Expand the token's position to cover the full range
+
+When an EOF is detected, the lexer is marked as spent (eof: true) and will only
+return EOF tokens. The lexer will only return an error alongside an EOF token if
+the EOF was unexpected.
+
+The lexer also keeps track of its current position in order to embed it into
+tokens, and to print errors. It is important that the lowest level operation
+used to advance the lexer's position is fsplLexer.nextRune(), as it contains
+logic for keeping the position correct and maintaining the current lexer state.
--- a/parser/README.md
+++ b/parser/README.md
@ -0,0 +1,128 @@
+# parser
+
+## Responsibilities
+
+- Define syntax tree type that contains entities
+- Turn streams of tokens into abstract syntax tree entities
+
+## Organization
+
+The entry point for all logic defined in this package is the Tree type. On this
+type, the Parse() method is defined. This method creates a new parser, and uses
+it to parse a stream of tokens into the tree. The details of the parser are
+hidden from other packages, and instances of it only last for the duration of
+Tree.Parse().
+
+## Operation
+
+The parser holds a pointer to the Tree that created it, as well as the lexer
+that was passed to it. Its parse() method attempts to consume all tokens
+produced by the lexer, parsing them into syntax entities which it places into
+the tree.
+
+parser.parse() parses top level entities, which include functions, methods, and
+typedefs. For each top-level entity, the parser will call a specialized parsing
+routine to parse that entity depending on the current token's kind and value.
+These routines in turn call other routines, which call other routines, etc.
+
+All parsing routines follow this general flow:
+
+- Start with the token already present in Parser.token. Do not get the
+  token after it.
+- Use Parser.expect(), Parser.expectValue(), etc. to test whether the token
+  is a valid start for the entity
+- If starting by calling another parsing method, trust that method to do
+  this instead.
+- When getting new tokens, use Parser.expectNext(),
+  Parser.expectNextDesc(), etc. Only use Parser.next() when getting a token
+  *right before* calling another parsing method, or at the *very end* of
+  the current method.
+- To terminate the method, get the next token and do nothing with it.
+- If terminating by calling another parsing method, trust that method to do
+  this instead.
+
+Remember that parsing routines always start with the current token, and end by
+getting a trailing token for the next method to start with. This makes it
+possible to reliably switch between parsing methods depending on the type or
+value of a token.
+
+The parser must never backtrack or look ahead, but it may revise previous
+data it has output upon receiving a new token that comes directly after the
+last token of said previous data. For example:
+
+- X in XYZ may not be converted to A once the parser has seen Z, but
+- X in XYZ may be converted to A once the parser has seen Y.
+
+This disallows complex and ambiguous syntax, but should allow things such as
+the very occasional infix operator (like . and =)
+
+### Expression Parsing
+
+Expression notation is the subset of FSPL that is used to describe
+computations and data/control flow. The expression parsing routine is the most
+important part of the FSPL parser, and also the most complex. For each
+expression, the parser follows this decision tree to determine what to parse:
+
+```
+| +Ident =Variable
+|        | 'true'       =LiteralBoolean
+|        | 'false'      =LiteralBoolean
+|        | 'nil'        =LiteralNil
+|        | 'if'         =IfElse
+|        | 'loop'       =Loop
+|        | +Colon       =Declaration
+|        | +DoubleColon =Call
+|
+| +LParen X
+|         | +Star =LiteralArray
+|         | +Dot  =LiteralStruct
+|
+| +LBracket | +Ident =Call
+|           |        | 'break'  =Break
+|           |        | 'return' =Return
+|           | 
+|           | +Dot +Expression =Dereference
+|           |                  | +Expression =Subscript
+|           | +Star =Operation
+|           |
+|           | +Symbol X
+|                     | '\'      =Slice
+|                     | '#'      =Length
+|                     | '@'      =Reference
+|                     | '~'      =ValueCast
+|                     | '~~'     =BitCast
+|                     | OPERATOR =Operation
+|
+| +LBrace =Block
+| +Int    =LiteralInt
+| +Float  =LiteralFloat
+| +String =LiteralString
+```
+
+Each branch of the decision tree is implemented as a routine with one or more
+switch statements which call other routines, and each leaf is implemented as a
+normal entity parsing routine.
+
+Expressions that are only detected after more than one token has been
+consumed have parsing routines with "-Core" appended to them. This means that
+they do not begin at the first token in the expression, but instead at the point
+where there is no longer any ambiguity as to what they are.
+
+### Type Parsing
+
+Type notation is the subset of FSPL that is used to describe data types. Though
+it is not as complex as expression notation, it still needs a decision tree to
+determine what type to parse:
+
+```
+| +Ident     =TypeNamed
+| +TypeIdent =TypeNamed
+| +Int       =TypeArray
+|
+| +LParen X
+|         | +Dot        =TypeStruct
+|         | +Symbol '~' =TypeInterface
+|
+| +Star =TypePointer
+        | +Colon     =TypeSlice
+```
--- a/parser/expression.go
+++ b/parser/expression.go
@ -4,44 +4,6 @@ import "git.tebibyte.media/sashakoshka/fspl/lexer"
 import "git.tebibyte.media/sashakoshka/fspl/errors"
 import "git.tebibyte.media/sashakoshka/fspl/entity"

-// Expression decision tree flow:
-//
-//   | +Ident =Variable
-//   |        | 'true'       =LiteralBoolean
-//   |        | 'false'      =LiteralBoolean
-//   |        | 'nil'        =LiteralNil
-//   |        | 'if'         =IfElse
-//   |        | 'loop'       =Loop
-//   |        | +Colon       =Declaration
-//   |        | +DoubleColon =Call
-//   |
-//   | +LParen X
-//   |         | +Star =LiteralArray
-//   |         | +Dot  =LiteralStruct
-//   |
-//   | +LBracket | +Ident =Call
-//   |           |        | 'break'  =Break
-//   |           |        | 'return' =Return
-//   |           | 
-//   |           | +Dot +Expression =Dereference
-//   |           |                  | +Expression =Subscript
-//   |           | +Star =Operation
-//   |           |
-//   |           | +Symbol X
-//   |                     | '\'      =Slice
-//   |                     | '#'      =Length
-//   |                     | '@'      =Reference
-//   |                     | '~'      =ValueCast
-//   |                     | '~~'     =BitCast
-//   |                     | OPERATOR =Operation
-//   |
-//   | +LBrace =Block
-//   | +Int    =LiteralInt
-//   | +Float  =LiteralFloat
-//   | +String =LiteralString
-//
-// Entities with a star have yet to be implemented
-
 var descriptionExpression = "expression"
 var startTokensExpression = []lexer.TokenKind {
 	lexer.Ident,
--- a/parser/parser.go
+++ b/parser/parser.go
@ -5,36 +5,6 @@ import "fmt"
 import "git.tebibyte.media/sashakoshka/fspl/lexer"
 import "git.tebibyte.media/sashakoshka/fspl/errors"

-// When writing a parsing method on Parser, follow this flow:
-//   - Start with the token already present in Parser.token. Do not get the
-//     token after it.
-//   - Use Parser.expect(), Parser.expectValue(), etc. to test whether the token
-//     is a valid start for the entity
-//   - If starting by calling another parsing method, trust that method to do
-//     this instead.
-//   - When getting new tokens, use Parser.expectNext(),
-//     Parser.expectNextDesc(), etc. Only use Parser.next() when getting a token
-//     *right before* calling another parsing method, or at the *very end* of
-//     the current method.
-//   - To terminate the method, get the next token and do nothing with it.
-//   - If terminating by calling another parsing method, trust that method to do
-//     this instead.
-//
-// Remember that parsing methods always start with the current token, and end by
-// getting a trailing token for the next method to start with. This makes it
-// possible to reliably switch between parsing methods depending on the type or
-// value of a token.
-//
-// The parser must never backtrack or look ahead, but it may revise previous
-// data it has output upon receiving a new token that comes directly after the
-// last token of said previous data. For example:
-//
-//   X in XYZ may not be converted to A once the parser has seen Z, but
-//   X in XYZ may be converted to A once the parser has seen Y.
-//
-// This disallows complex and ambiguous syntax, but should allow things such as
-// the very occasional infix operator (like . and =)
-
 // Parser parses tokens from a lexer into syntax entities, which it places into
 // a tree.
 type Parser struct {