Using and Porting GNU Fortran

Node: lex.c, Next: sta.c, Previous: g77stripcard, Up: Overview of Translation Process

lex.c

To help make the lexer simple, fast, and easy to maintain, while also having g77 generally encourage Fortran programmers to write simple, maintainable, portable code by maximizing the performance of compiling that kind of code:

There'll be just one lexer, for both fixed-form and free-form source.
It'll care about the form only when handling the first 7 columns of text, stuff like spaces between strings of alphanumerics, and how lines are continued.
Some other distinctions will be handled by subsequent phases, so at least one of them will have to know which form is involved.
For example, I = 2 . 4 is acceptable in fixed form, and works in free form as well given the implementation g77 presently uses. But the standard requires a diagnostic for it in free form, so the parser has to be able to recognize that the lexemes aren't contiguous (information the lexer does have to provide) and that free-form source is being parsed, so it can provide the diagnostic.
The g77 lexer doesn't try to gather 2 . 4 into a single lexeme. Otherwise, it'd have to know a whole lot more about how to parse Fortran, or subsequent phases (mainly parsing) would have two paths through lots of critical code--one to handle the lexeme 2, ., and 4 in sequence, another to handle the lexeme 2.4.
It won't worry about line lengths (beyond the first 7 columns for fixed-form source).
That is, once it starts parsing the "statement" part of a line (column 7 for fixed-form, column 1 for free-form), it'll keep going until it finds a newline, rather than ignoring everything past a particular column (72 or 132).
The implication here is that there shouldn't be anything past that last column, other than whitespace or commentary, because users using typical editors (or viewing output as typically printed) won't necessarily know just where the last column is.
Code that has "garbage" beyond the last column (almost certainly only fixed-form code with a punched-card legacy, such as code using columns 73-80 for "sequence numbers") will have to be run through g77stripcard first.
Also, keeping track of the maximum column position while also watching out for the end of a line and while reading from a file just makes things slower. Since a file must be read, and watching for the end of the line is necessary (unless the typical input file was preprocessed to include the necessary number of trailing spaces), dropping the tracking of the maximum column position is the only way to reduce the complexity of the pertinent code while maintaining high performance.
ASCII encoding is assumed for the input file.
Code written in other character sets will have to be converted first.
Tabs (ASCII code 9) will be converted to spaces via the straightforward approach.
Specifically, a tab is converted to between one and eight spaces as necessary to reach column n, where dividing (n - 1) by eight results in a remainder of zero.
That saves having to pass most source files through expand.
Linefeeds (ASCII code 10) mark the ends of lines.
A carriage return (ASCII code 13) is accept if it immediately precedes a linefeed, in which case it is ignored.
Otherwise, it is rejected (with a diagnostic).
Any other characters other than the above that are not part of the GNU Fortran Character Set (see Character Set) are rejected with a diagnostic.
This includes backspaces, form feeds, and the like.
(It might make sense to allow a form feed in column 1 as long as that's the only character on a line. It certainly wouldn't seem to cost much in terms of performance.)
The end of the input stream (EOF) ends the current line.
The distinction between uppercase and lowercase letters will be preserved.
It will be up to subsequent phases to decide to fold case.
Current plans are to permit any casing for Fortran (reserved) keywords while preserving casing for user-defined names. (This might not be made the default for .f files, though.)
Preserving case seems necessary to provide more direct access to facilities outside of g77, such as to C or Pascal code.
Names of intrinsics will probably be matchable in any case,
(How external SiN; r = sin(x) would be handled is TBD. I think old g77 might already handle that pretty elegantly, but whether we can cope with allowing the same fragment to reference a different procedure, even with the same interface, via s = SiN(r), needs to be determined. If it can't, we need to make sure that when code introduces a user-defined name, any intrinsic matching that name using a case-insensitive comparison is "turned off".)
Backslashes in CHARACTER and Hollerith constants are not allowed.
This avoids the confusion introduced by some Fortran compiler vendors providing C-like interpretation of backslashes, while others provide straight-through interpretation.
Some kind of lexical construct (TBD) will be provided to allow flagging of a CHARACTER (but probably not a Hollerith) constant that permits backslashes. It'll necessarily be a prefix, such as:
```
          PRINT *, C'This line has a backspace \b here.'
          PRINT *, F'This line has a straight backslash \ here.'
          
```
Further, command-line options might be provided to specify that one prefix or the other is to be assumed as the default for CHARACTER constants.
However, it seems more helpful for g77 to provide a program that converts prefix all constants (or just those containing backslashes) with the desired designation, so printouts of code can be read without knowing the compile-time options used when compiling it.
If such a program is provided (let's name it g77slash for now), then a command-line option to g77 should not be provided. (Though, given that it'll be easy to implement, it might be hard to resist user requests for it "to compile faster than if we have to invoke another filter".)
This program would take a command-line option to specify the default interpretation of slashes, affecting which prefix it uses for constants.
g77slash probably should automatically convert Hollerith constants that contain slashes to the appropriate CHARACTER constants. Then g77 wouldn't have to define a prefix syntax for Hollerith constants specifying whether they want C-style or straight-through backslashes.
To allow for form-neutral INCLUDE files without requiring them to be preprocessed, the fixed-form lexer should offer an extension (if possible) allowing a trailing & to be ignored, especially if after column 72, as it would be using the traditional Unix Fortran source model (which ignores everything after column 72).

The above implements nearly exactly what is specified by Character Set, and Lines, except it also provides automatic conversion of tabs and ignoring of newline-related carriage returns, as well as accommodating form-neutral INCLUDE files.

It also implements the "pure visual" model, by which is meant that a user viewing his code in a typical text editor (assuming it's not preprocessed via g77stripcard or similar) doesn't need any special knowledge of whether spaces on the screen are really tabs, whether lines end immediately after the last visible non-space character or after a number of spaces and tabs that follow it, or whether the last line in the file is ended by a newline.

Most editors don't make these distinctions, the ANSI FORTRAN 77 standard doesn't require them to, and it permits a standard-conforming compiler to define a method for transforming source code to "standard form" however it wants.

So, GNU Fortran defines it such that users have the best chance of having the code be interpreted the way it looks on the screen of the typical editor.

(Fancy editors should never be required to correctly read code written in classic two-dimensional-plaintext form. By correct reading I mean ability to read it, book-like, without mistaking text ignored by the compiler for program code and vice versa, and without having to count beyond the first several columns. The vague meaning of ASCII TAB, among other things, complicates this somewhat, but as long as "everyone", including the editor, other tools, and printer, agrees about the every-eighth-column convention, the GNU Fortran "pure visual" model meets these requirements. Any language or user-visible source form requiring special tagging of tabs, the ends of lines after spaces/tabs, and so on, fails to meet this fairly straightforward specification. Fortunately, Fortran itself does not mandate such a failure, though most vendor-supplied defaults for their Fortran compilers do fail to meet this specification for readability.)

Further, this model provides a clean interface to whatever preprocessors or code-generators are used to produce input to this phase of g77. Mainly, they need not worry about long lines.