To help make the lexer simple, fast, and easy to maintain,
while also having g77
generally encourage Fortran programmers
to write simple, maintainable, portable code by maximizing the
performance of compiling that kind of code:
Some other distinctions will be handled by subsequent phases, so at least one of them will have to know which form is involved.
For example, I = 2 . 4
is acceptable in fixed form,
and works in free form as well given the implementation g77
presently uses.
But the standard requires a diagnostic for it in free form,
so the parser has to be able to recognize that
the lexemes aren't contiguous
(information the lexer does have to provide)
and that free-form source is being parsed,
so it can provide the diagnostic.
The g77
lexer doesn't try to gather 2 . 4
into a single lexeme.
Otherwise, it'd have to know a whole lot more about how to parse Fortran,
or subsequent phases (mainly parsing) would have two paths through
lots of critical code--one to handle the lexeme 2
, .
,
and 4
in sequence, another to handle the lexeme 2.4
.
That is, once it starts parsing the "statement" part of a line (column 7 for fixed-form, column 1 for free-form), it'll keep going until it finds a newline, rather than ignoring everything past a particular column (72 or 132).
The implication here is that there shouldn't be anything past that last column, other than whitespace or commentary, because users using typical editors (or viewing output as typically printed) won't necessarily know just where the last column is.
Code that has "garbage" beyond the last column
(almost certainly only fixed-form code with a punched-card legacy,
such as code using columns 73-80 for "sequence numbers")
will have to be run through g77stripcard
first.
Also, keeping track of the maximum column position while also watching out for the end of a line and while reading from a file just makes things slower. Since a file must be read, and watching for the end of the line is necessary (unless the typical input file was preprocessed to include the necessary number of trailing spaces), dropping the tracking of the maximum column position is the only way to reduce the complexity of the pertinent code while maintaining high performance.
Code written in other character sets will have to be converted first.
Specifically, a tab is converted to between one and eight spaces
as necessary to reach column n,
where dividing (
n - 1)
by eight
results in a remainder of zero.
That saves having to pass most source files through expand
.
Otherwise, it is rejected (with a diagnostic).
This includes backspaces, form feeds, and the like.
(It might make sense to allow a form feed in column 1 as long as that's the only character on a line. It certainly wouldn't seem to cost much in terms of performance.)
It will be up to subsequent phases to decide to fold case.
Current plans are to permit any casing for Fortran (reserved) keywords
while preserving casing for user-defined names.
(This might not be made the default for .f
files, though.)
Preserving case seems necessary to provide more direct access
to facilities outside of g77
, such as to C or Pascal code.
Names of intrinsics will probably be matchable in any case,
(How external SiN; r = sin(x)
would be handled is TBD.
I think old g77
might already handle that pretty elegantly,
but whether we can cope with allowing the same fragment to reference
a different procedure, even with the same interface,
via s = SiN(r)
, needs to be determined.
If it can't, we need to make sure that when code introduces
a user-defined name, any intrinsic matching that name
using a case-insensitive comparison
is "turned off".)
CHARACTER
and Hollerith constants
are not allowed.
This avoids the confusion introduced by some Fortran compiler vendors providing C-like interpretation of backslashes, while others provide straight-through interpretation.
Some kind of lexical construct (TBD) will be provided to allow
flagging of a CHARACTER
(but probably not a Hollerith)
constant that permits backslashes.
It'll necessarily be a prefix, such as:
PRINT *, C'This line has a backspace \b here.' PRINT *, F'This line has a straight backslash \ here.'
Further, command-line options might be provided to specify that
one prefix or the other is to be assumed as the default
for CHARACTER
constants.
However, it seems more helpful for g77
to provide a program
that converts prefix all constants
(or just those containing backslashes)
with the desired designation,
so printouts of code can be read
without knowing the compile-time options used when compiling it.
If such a program is provided
(let's name it g77slash
for now),
then a command-line option to g77
should not be provided.
(Though, given that it'll be easy to implement, it might be hard
to resist user requests for it "to compile faster than if we
have to invoke another filter".)
This program would take a command-line option to specify the default interpretation of slashes, affecting which prefix it uses for constants.
g77slash
probably should automatically convert Hollerith
constants that contain slashes
to the appropriate CHARACTER
constants.
Then g77
wouldn't have to define a prefix syntax for Hollerith
constants specifying whether they want C-style or straight-through
backslashes.
&
to be ignored, especially if after
column 72, as it would be using the traditional Unix Fortran source
model (which ignores everything after column 72).
The above implements nearly exactly what is specified by Character Set, and Lines, except it also provides automatic conversion of tabs and ignoring of newline-related carriage returns, as well as accommodating form-neutral INCLUDE files.
It also implements the "pure visual" model,
by which is meant that a user viewing his code
in a typical text editor
(assuming it's not preprocessed via g77stripcard
or similar)
doesn't need any special knowledge
of whether spaces on the screen are really tabs,
whether lines end immediately after the last visible non-space character
or after a number of spaces and tabs that follow it,
or whether the last line in the file is ended by a newline.
Most editors don't make these distinctions, the ANSI FORTRAN 77 standard doesn't require them to, and it permits a standard-conforming compiler to define a method for transforming source code to "standard form" however it wants.
So, GNU Fortran defines it such that users have the best chance of having the code be interpreted the way it looks on the screen of the typical editor.
(Fancy editors should never be required to correctly read code written in classic two-dimensional-plaintext form. By correct reading I mean ability to read it, book-like, without mistaking text ignored by the compiler for program code and vice versa, and without having to count beyond the first several columns. The vague meaning of ASCII TAB, among other things, complicates this somewhat, but as long as "everyone", including the editor, other tools, and printer, agrees about the every-eighth-column convention, the GNU Fortran "pure visual" model meets these requirements. Any language or user-visible source form requiring special tagging of tabs, the ends of lines after spaces/tabs, and so on, fails to meet this fairly straightforward specification. Fortunately, Fortran itself does not mandate such a failure, though most vendor-supplied defaults for their Fortran compilers do fail to meet this specification for readability.)
Further, this model provides a clean interface
to whatever preprocessors or code-generators are used
to produce input to this phase of g77
.
Mainly, they need not worry about long lines.