Bob Walton's New Programming Language

I have been developing a new general purpose programming language, step by step, over a period of years.

The main goal of this language is to be usable by commercial programmers, while being accessible to average high school students, who should be able to modify and augment parts of programs written in the language. Thus games or simulations written in the language should be modifiable by their users. The language is yet unnamed.

After teaching programming to students who are on the margin of being able to learn to program, I arrived at the following principals for designing such a language:

Type systems, proof systems, garbage collection, graphic description systems, distributed computation systems, real time systems, debugging systems, and other such things are related to the development of this language.

This new language has been through a basic research phase, in which ideas for the language have been discovered and evaluated. In 1999 the language entered the implementation phase. It is expected that implementation of a first version with byte code compiler will require 15,000 to 30,000 source lines of code, although parts of the language system, in particular the data subsystem, are independent of the rest of the system and should be done early.

Phase I: Basic Research

Phase Ia: R-Code, a Very Capable Virtual Computer

The first effort to define the new language centered on defining a virtual computer that would contain the new features to be embodied in the language. This took about 3 years (1992-95) and resulted in: The main accomplishments of the R-CODE design are a good design for a universal real-time-capable garbage collector and a design for type maps, which generalize C++ virtual function tables and Haskell dictionaries. R-CODE also includes designs for parallel control and data sharing, but these particular designs may not survive to be included in the final language.

The main conclusion of the garbage collector design is that a fixed-stub/relocatable-body organization should be used for objects. Each object has a stub (32 bytes in the current implementation efforts) at a fixed address in memory, and may have one or more separate relocatable bodies, of any length. The bodies can be moved at (almost) any time, and can be deallocated at (almost) any time, even independently of the garbage collector. The stubs are managed by a classic real-time mark-and-sweep algorithm, with or without ephemeral levels. The details are in Chapter 2 of the above thesis.

The hope is that this garbage collector will be UNIVERSAL, and will be usable even in the guts of future operating systems and real-time programs. The technology seems to fit both the large scale real-time radar systems and the scientific data managment systems (super-MatLab systems) that I have helped build in the past.

A type map is a generalization of the virtual function tables of C++ and the method tables of JAVA. A type map is a vector of constant data, typically pointers to functions, that describes one or more types. For example, suppose a function is passed a matrix whose elements are of an unknown type that form a lattice, with max and min operators (this example comes from the work of Mike Karr). The function could also be passed a type map containing pointers to the max and min functions to be used with the particular type of elements. Then the function could execute some algorithm on the matrix that used only the latice operations on the matrix elements, even though the exact type of these elements is unknown to the function. These elements might, for example, be 8 bit bytes, and max and min might be bitwise OR and AND, respectively.

Type maps are at most very slight generalizations of Haskell dictionaries. The point is that they need to be made first class citizens of any modern virtual machine and any modern programming language. Haskell is very close to doing this, but C++ and JAVA are seriously defficient, and will not handle the above example. Type maps, as part of a virtual computer, are investigated in Chapter 3 of the above thesis.

Phase Ib: Gentle Introduction to Programming

The second effort to define the new language centered on trying to teach students who had difficulties learning to program, in order to find out at least roughly what programming language features would and would not work with the high schoolers for whom the new language was in part intended. The mechanism here was to teach introductory programming at Suffolk University for a few years, and to write a textbook, or rather a series of revisions of a textbook, and to develop companion programming environments. Thanks go to Paul Ezust, Tom Fratto, and others at Suffolk for aiding this endeavor.

Suffolk University requires all of its students to take a computer course. Many take a course in which they learn to use a fancy calculator program, somewhat like MatLab. A few take a programming course for students who can learn to program fairly easily. In between are students who want to learn to program, but find it somewhat difficult. This is the course I taught for two years, and for which I developed various versions of a textbook over three years.

The textbook and its versions are:

The main result of this effort was the three principals stated in the introduction to this page.

Another result was the usability of rewrite rule formalism to teach programming to beginning students, even those with limited talent. Also, in a much more advanced course, an Introduction to Computer Science at Harvard, I was able to successfully use rewrite rules to teach the LISP programming language, as exemplified by the text I wrote for the first part of this more advanced course:

Phase Ic: Infostore

For many years I have been interested in distributed data collection and management systems, and, in particular, in the idea that raw data should be encoded as constant records of information that have unique identifiers and can point each other, rather like a world-wide LISP memory whose contents never change except to increase. In later years I have taken to calling these constant records `infemes', and to calling storage systems base in infemes `infostores'.

In 1993 I was exposed to Prof. Cheatham's Artifacts System, a distributed code and document management system. I decided that such systems could benefit from the infeme idea, and said so in a class paper. This lead in 1996 to a student project in a course taught by Prof. Cheatham and I, in which the students did the initial phases of an implementation of a system based on a customer specification for an infostore system.

Then in 1997 I wrote a prototype infostore system which partly implemented the description in:

This system, which was written in three months for use by students in a class I was teaching at the time, contains a prototype implementation of two important ideas.

The first idea is to upgrade standard LISP memories to have information that make their objects more powerful, particularly in the area of making the objects more displayable. So an infostore is very like a standard LISP memory, shared by one or more processes, but with upgrades to the underlying data types.

For example, the notion of a symbol is upgraded to that of a `word', which supports upper and lower case and a set of flags and properties that can govern the word's display. Examples of flags include a capitalization flag, permitting `Massachusetts' to be stored internally as `massachusetts', and flags indicating whether to join the word to preceding or following words, as when `,' is a `word' joined to preceding words.

The notion of a list is then extended to permit flags and properties to be attached to individual list element positions (not to the list member objects themselves). Thus an element which is a sublist can be marked as a parenthesized phrase, sentence, or paragraph to indicate how it is to be separated from its predecessors and successors. And elements that are words can be marked with the same flags as words, without marking the words themselves. Thus a word can be capitalized at the beginning of a sentence without capitalizing it elsewhere.

In general, the goal is to develop a data structure that upgrades the LISP symbol and cons cell data storage ideas to become a mechanism for storing word processing data. This idea can be extended to other data structures: for example, arrays can be upgraded by adding labels so they display like spreadsheets.

The second idea was to define an infeme to be the set of objects in memory rooted at a particular object, and to give that root object a unique ID as a property. The unique ID is just a 128 bit random number, and thus can be generated by a process independently of other processes. An infeme is therefore a possibly cyclic graph of objects with a particular root object that has a unique ID.

Precisely, an object is in an infeme if there is a path from the infeme's root to the object which does not pass through another infeme root. When an infeme is written out, all its objects are written out, but if one of these contains a pointer to the root of another infeme, only the unique ID of that other infeme is written out. When the infeme is read, if it contains a pointer to another infeme, that infeme is looked up in the reader's memory in order to establish the correct pointer. But if that infeme does not exist in the reader's memory, a stub is allocated for its root object. This stub is an object whose type is not yet determined, though it is known to be an infeme root and does have a unique ID property. If an infeme with this same unique ID is read later, its root object is allocated this same stub, and the stub is filled in with type and value information.

More details on the 1997 version of the infostore system can be found here.

Phase Id: The Unbundled Compiler Rewrite

My thesis advisor, Prof. Thomas Cheatham, developed the `Unbundled Compiler' during the late 1980's and early 1990's. This as a very modular compiler written in less than 10,000 lines of Common Lisp.

I decided to build a new version of this compiler in C++ in order to compile my new programming language, when it was ready. I called this project `UBC'. I wrote about 3000 lines of unfinished code in 1997 and 1998.

One main idea was to implement a storage system for labeled graphs that would remember the history of changes to these graphs, and thus form the basic data structure for a debugger that displayed the execution of a program as a series of rewrites of the program. For example, if the statement `x=y+5' were executed in a context where y equaled 7, the statement would appear to execute by being first rewritten to `x=7+5' and then to `x=12'. Note that rewrite rules are being used here to display the execution of the program, and NOT to define how the program executes. Thus the decision to rewrite `y' to `7' is made by an interpreter that is NOT defined in terms of rewrite rules.

Some investigations of graphs with histories were conducted and resulted in an unfinished paper and some work by students on a prototype. These investigations are reported in:

In order to handle the macro package of C, UBC was augmented with lists that remembered the history of changes to them.

In order to facilitate incremental compilation, the data structures of UBC were organized in a way that permitted them to be stored verbatim on disk, to be reused at a later date.

I then decided to merge the UBC work into the above infostore system. See Phase IIa, Infostore Implementation, below.

Phase Ie: The General Type System (GTS)

I need a type system for the new language that will make type maps (see Phase Ia above) first class citizens. My first effort, done in 1997 and 1998, was a maximally general type system, called GTS, based on logic programming. This used a generalization of the proof methods of Prolog to prove type correctness of programs. The work is described in the unfinished paper:

While GTS may be a great new type system, I decided it is probably risky overkill for a new language. What I intend to try instead is basically a Hindley-Milner type system with the Haskell upgrade and classical (not Haskell) overloading.

The Hindley-Milner type system (originally discovered by Curry) treats types as labeled graphs, and uses labeled graph matching to resolve unknown types.

The Haskell upgrade is to introduce a set of one-valued types, that is, types which have at most one possible value which can be computed when the program is loaded. Function call arguments whose type is one-valued can be implicit, as they can be computed by the system. One-valued types are very useful, for the types of type maps are one-valued. Thus a language with support for one-valued types will support type maps. Haskell embelishes the idea of one-valued types in serveral ways.

Classical overloading permits many different functions to have the same name if they have different types. I feel this is necessary in a professional programming language because otherwise the name space problems of very large libraries become unbearable. With classical overloading, only the type names need be unique, as overloaded function names can be resolved from the types of their arguments and values.

Probably the best approach for implementing classical overloading in a Hindley-Milner type system is to give each use of an overloaded name an unknown type represented by a unique type variable, resolve the most general types on this basis, and then try immediately eliminating as many name options as possible using the information accumulated so far. After this preliminary work, the algorithm ends with an exponential search, which I hope will not consume exponential time in actual practice. Several other type algorithms that have proved useful for practical languages are also worst-case exponential.

In any case, making a new and more potent type system is a risk because one never knows how fast it will run on large scale programs until one has a complete language, a complete implementation, and large programs using these.

Phase II: Implementation

Modern programming languages differ not so much in syntax as in the data structures they support. Thus C++ supports fairly arbitrary low level structures, JAVA supports more limited somewhat higher level structures, and MatLab supports matrices, vectors, scalar numbers, and character strings.

The research on data structures for the new programming language is done, so it is time to proceed to implementation of the data structures.

Modern programming languages also differ somewhat in control flow. How to integrate functional programming and finite state machine style programming (e.g., object oriented programming) is still an open issue. However, this can be addressed by trial and error once the rather fancy data structures for the new language have been implemented.

Phase IIa: Infostore Implementation

Implementation of data structures for the new language began in 1999. The strategy has been to start with the 1997 infostore code referenced above, upgrade that code to have most the features mentioned in the 1997 infostore user's manual, and add new features that were being built into the aborted C++ UBC project.

An introduction to this updated infostore system is here.

The first result should be 5,000 to 10,000 lines of code and be complete as far as basic data structures and operations are concerned. Once this is done, I expect that fancy input/output will become the subject of very protracted experiments, along with associated research on how to mark data with flags and properties so the data may best store word processing and other displayable information.