How machines have been learning (recently and still today)
The term machine learning covers various methods of how algorithms create models representing sample data that have been used to train the models to make decisions or proposals without being programmed to do so. The base for the learning is a set of data possessing the same characteristics as the data to be generated; a truly large file as a rule – what is not in it, the AI has been unable to learn. Recently, diverse machine learning strategies have been applied – supervised learning, unsupervised learning, reinforcement learning – and new variations and developments have emerged.
In supervised learning, the system is given a series of categorized or labeled examples and told to make predictions about new examples it hasn´t seen yet, or for which the ground truth is not yet known [13]. Supervised learning uses labeled datasets, whereas unsupervised learning uses unlabeled datasets. “Labeled” means the data is already man-tagged with the requested answer [14].
Unsupervised learning analyzes and clusters unlabeled datasets. These algorithms discover hidden patterns or data groupings without the need for human intervention – without the need for labeling the datasets. In unsupervised learning, a machine is simply given a heap of data and told to make sense of it, to find patterns, regularities, and useful ways of condensing, or representing, or visualizing it [13, 15].
Reinforcement learning concerns how intelligent agents ought to take action in an environment to maximize the notion of cumulative reward. Based typically on the Makarov decision process (a discrete- and also continuous-time stochastic control process in mathematics) [16], reinforcement learning differs from supervised learning in not needing labeled input/output pairs to be presented, and in not needing sub-optimal actions to be explicitly corrected. Instead, it focuses on finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge). In other words, placed into an environment with rewards and punishments, [the system is] told to figure out the best way to minimize the punishments and maximize the rewards [13, 17]. As the comprehension upgrades of learning processes and – especially – of consequences of their details for the outputs, the reward signals to fine-tune the models tend to be human preferences based (referred to as reinforcement learning with human feedback [18] that, as the subhead Production ecosystems and sensational novelties of this section explains, is to be distinguished from reinforcement learning from human feedback [19]) instead of simple automatic metrics. Indicated further in this section, the safety and alignment problems are the starting point for the deployment of these approaches that are much more time- and cost-consuming. Such is, for example, the case of InstructGPT – one of the most advanced language models today.
Concerning architectural designing and planning, reinforcement learning deserves highlighting as a platform on which diverse imitation-based learning strategies develop that this paper promotes as the future of AI in architecture; the next sections delve into the topic.
New implementations of learning paradigms and new learning models evolve. Countless facets of motivation and reward has been studied and developed in this regard, comprising the dopamine-releasing mechanisms, possible-value- and/or actual-expectation-motivation, curiosity, a self-motivated desire for knowledge, imitation and interactive imitation, self-imitation, and transcendence. The results are random network distillation algorithms using prediction errors as a reward signal [20] or algorithms approximating a state-value function in a quality-learning framework [21], or knowledge-seeking agents [22]. Unplugging the hardwired external rewards not only makes the algorithm capable of playing dozens of Atari games with equal felicity (similarly as Alpha Zero will be reminded just as adept at chess as it is at shogi or Go [23]). In such algorithms, AI agents render able to come up with their own objectives, measuring intelligence in the end effect in terms of how things behave – not in terms of the reward function [24]. These schemes may render a significant part of building a general AI instead of the current models that only specialize in a specific, narrow task.
Imitating the human and other cunning strategies
Shifting the focus from the statistical approach of input-output matching to mimicking (not only) human processes appears to be a key innovation. Indeed, why a machine could not learn by observing the man acting? Still, learned in such a way, a machine can outperform a human, as is the rule: by working relentlessly, flawlessly, and much quicker. Given the nature of experiencing architecture and, most importantly, how architecture emerges in the design processes, such approaches could bring the so-much-sought-after support for the architects´ craft. Nonetheless, there is a much broader – almost unlimited field of deployment for the new philosophy represented by strategies and techniques of imitation learning, self-play, inverse Q-learning, adversarial imitation learning (AIL), action-guided adversarial imitation learning (AGAIL), world imitation learning (WIL), behavioral cloning from observation (BCO), augmented behavioral cloning from observation, inverse reinforcement learning, or transfer learning.
In general, zooming on a (design) process – „how things come to existence“ – instead of output – „how things shall be“, imitation-based learning is a framework for learning behavior from demonstrations usually presented in state-action trajectories, with each pair denoting the action to take at the state visited [25]. Imitation-based learning provides three distinct advantages over trial-and-error learning: efficiency, safety, and, which also renders promising for AI´s deployment in architecture, the ability to learn things that are hard to describe [26]. As one of the techniques applied, behavior cloning treats the action as the target label for a state and learns a generalized mapping from state to action. Inverse reinforcement learning (IRL), on the other hand, views the demonstrated actions as a sequence of decisions and aims at finding a reward/cost function under which the demonstrated decisions are optimal. Reinforcement learning generally supports these techniques; an agent learns a policy – a mapping from states to actions – that maximizes some notion of cumulative reward. The agent iteratively updates its policy based on the feedback (rewards) it receives from the environment. It is inherently a process-oriented approach, as the agent is learning how to act in different states rather than just predicting the outcome of actions.
In imitation learning, the primary focus is on mimicking the actions of an expert. The agent tries to replicate the expert’s behavior as closely as possible. While the outcomes of actions are certainly important (they can signal how well the agent is doing), the main goal is to learn the expert’s policy – the sequence of actions the expert takes in different situations [25].
A radical alternative, world imitation learning goals to achieve similar performance to the expert without explicitly defining a reward function. Learning by observing the expert´s actions in a given environment, the agent constructs a model of the world (a „world model“) based on the observed data. The world model captures the dynamics of the environment, including state transitions and outcomes of actions. To encourage imitation, an intrinsic reward is defined within the latent space of the world model. This reward guides the agent’s exploration during reinforcement learning. Addressing the problem of covariate shift (adapting to changes in the input data distribution during model training, adjusting the learning strategy when the rules of the game change [27]), where the data distribution during training differs from that during deployment, WIL can be applied offline using historical data without requiring additional online interactions. The approach leverages world models to bridge the gap between expert demonstrations and reinforcement learning. Combining imitation learning with reinforcement learning, the agent improves the performance over time [28, 29, 30]. Applied successfully in robotics, autonomous driving, gaming, and other domains where expert behavior can serve as a guide, the possibility of application in architectural design has escaped attention so far. The world model would certainly not capture the doing in the architect’s studio but the architect’s actions in the model of emerging architecture in a design- or CAD (computer-aided design) software environment; the paper will come back to the topic.
The self-playing agent plays against itself and it updates its policy based on the outcome of each game. Its goal is to find a policy that maximizes the expected cumulative reward. The outcomes of the games drive the learning of the agent who, however, cares about the actions taken since they determine the outcome [31].
Q-learning is a model-free, value-based, off-policy algorithm that searches for the best series of actions based on the agent’s current state. With Q standing for quality, inverse Q-learning is a method for dynamics-aware learning that avoids adversarial training by learning a single Q-function, implicitly representing both reward and policy. It obtains state-of-the-art results in offline and online imitation learning settings, surpassing other methods in both the number of required environment interactions and scalability in high-dimensional spaces [32].
Adversarial imitation learning is a class of state-of-the-art algorithms commonly used in robotics. In AIL, an artificial adversary’s misclassification serves as a reward signal, which is subsequently optimized by any standard reinforcement learning algorithm [33].
Action-guided adversarial imitation learning processes even demonstrations with incomplete action sequences. AGAIL divides the state-action pairs in demonstrations into state trajectories and action trajectories and learns a policy from states with auxiliary guidance from available actions [34].
Behavioral cloning from observation is a two-phase, autonomous imitation learning technique aiming to provide improved performance in diverse fields such as healthcare, autonomous driving, and complex game-playing [35]. Combining imitation learning and model-based learning, the method is to let the agent learn its agent-specific inverse dynamics model first and then show the agent the state-only demonstrations. The agent can use its model to infer the missing actions that the expert took. After the actions are inferred, the agent can use imitation learning.
Augmented behavioral cloning from observation overcomes the BCO´s problem of reaching bad local minima by a self-attention mechanism that better captures global features of the states and a sampling strategy that regulates the observations used for learning [36].
In inverse reinforcement learning, the goal of the apprentice agent is to find a reward function from the expert demonstrations that could explain the expert behavior [37].
Applying beneficially in robotics, transfer learning is like borrowing knowledge from one task to help with another. It starts with a pre-trained model that has learned useful features from a large dataset, and then adapts the model to new tasks by removing the top layers of the neural network, adding new layers for the specific task, and training the modified model on a smaller dataset. The strategy proves not only to robotics; for production architecture, as coined in section (4), transfer learning is another technique that awaits exploration. Overcoming training from scratch, the transfer learning approach saves time; in addition, it can work even with limited data [38, 39, 40].
Another „next level“ of the extrinsic-reward-free schemes comes with interaction that allows the algorithm to work properly requiring incredibly little feedback as Ross´ dataset aggregation (DAgger) has shown [41]. Self-imitation and transcendence render to be a „top“ of recent learning schemes [42].
And multiple other sophisticated strategies adjoin to simplify the learning and increase the performance of the models. In-context learning allows „old models“ to learn new tasks by providing them with just a few examples (input-output pairs). Like strategies based on imitation, these approaches, unfortunately, have not even tried to apply themselves in architectural design that, when attempting to embrace AI, continues to stick to the supervised-learned schemes (as this section shows later). (The expressive image processing, in which Midjourney, Stable Diffusion, and other applications show encouraging results, stay aside: they help architects only indirectly in the fields of analyses´ research or thematization – they do not directly participate in designing the architecture itself.) An opportunity awaits: it would be a mistake to miss it.
Allowing to quickly building models for new tasks without demanding fine-tuning, the strategy is often used regarding large language models (LLMs), which learn in context, using the provided examples without adjusting their overall parameters. Instead of fine-tuning the entire model, it is given a prompt (a list of examples) and asked to predict the next tokens based on that prompt. In-context learning works even when provided with random outputs in the examples. Traditional supervised learning algorithms would fail with random outputs, but in-context learning still succeeds like doing some implicit Bayesian inference, using all parts of the prompt (inputs, outputs, formatting) to learn [43].
As opposed to closed machine-learning strategies (to some extent, the vast majority of the state-of-the-art models), open-ended machine-learning algorithms learn how to learn from data they encounter; they adapt and refine themselves based on past examples, when faced with new data, they predict outcomes or classify information without human intervention. Recommendation engines, speech recognition, fraud detection, or self-driving algorithms tend to adopt the strategy [44, 45].
A clever approach used specifically concerning LLMs, the tree of thoughts (ToT) machine-learning approach fits to be imagined as a tree where each branch represents a coherent sequence of language (a “thought”). These thoughts serve as intermediate steps toward solving a problem through deliberate decision-making: Instead of just making decisions one after the other (like reading a sentence left to right), ToT allows the model to consider multiple reasoning paths, evaluate their choices, and decide the best next step. Thus, the problem-solving abilities of the model are enhanced to think more strategically, explore different paths, and make better decisions [46, 47].
Commonly used for pathfinding problems in video games, robotics, and more (however not in architectural designing tools), the A* algorithm is an informed best-first search algorithm that efficiently finds the shortest path through a search space using a heuristic estimate. It combines the best features of breadth-first search (BFS) and Dijkstra’s algorithm, nevertheless, unlike BFS, which explores all possible paths, A* focuses on the most promising paths based on a heuristic function that guides A* toward the goal state, making it more efficient than traditional search algorithms. A* can be implemented in Python to find cost-effective paths in graphs [48, 49].
Dreamer agent is a reinforcement learning strategy that combines world models with efficient learning techniques. Used in robotics, games, and real-world scenarios (then, why still not in architectural designing?!), this technique is like teaching an AI to dream, plan, and act smartly. Dreamer learns a simplified model of the environment (a world model) from raw images; this world model predicts what will happen in the environment based on the agent’s actions. Instead of trial and error, Dreamer imagines thousands of possible action sequences in parallel; by computing compact model states from raw images, it learns from predictions using just one GPU. Using a value network to predict future rewards and an actor-network to choose actions, Dreamer considers rewards beyond the immediate future. These networks help the Dreamer to make informed decisions even in new, unknown situations [50, 51].
Autoassociative algorithms learn to remember patterns and retrieve them from partial or noisy input. A smart way to remember things even when the details are fuzzy, these algorithms create a compressed representation of data to map input patterns for themselves. Then, when given a distorted or incomplete input, they reconstruct the original pattern to prove helpful in denoising (removing noise from data), memory recall (helping to recognize familiar patterns), or anomaly detection (raising an alert if something does not match the learned patterns) [52].
Though originally bound for Atari 2600, the observe-and-look-further learning strategy still today addresses challenges like reward processing, long-term planning, and efficient exploration. The insights gained from Atari environments can be applied to other domains, such as robotics, natural language processing, and autonomous vehicles. Designed to tackle challenges in reinforcement learning, the algorithm processes rewards of varying densities and scales, it can handle different types of rewards effectively. Using an auxiliary temporal consistency loss, the algorithm can reason over extended time periods, which is crucial for complex tasks. To address the exploration problem more efficiently, the algorithm leverages human demonstrations [53].
The first-return-and-then-explore strategy is another clever approach in reinforcement learning that helps agents explore their environment more effectively. Returning to familiar states helps the agent remember promising locations – preferably spots with potential rewards or interesting features. By revisiting them, the agent reinforces its memory. Once known states have been revisited, the agent can confidently venture into uncharted territory and make smarter decisions. The strategy helps agents handle sparse rewards (where feedback is infrequent) and deceptive feedback (where rewards can mislead). Balancing curiosity with wisdom, the approach has shown impressive results in solving challenging reinforcement learning tasks, from video games to robotics (and perhaps once, it can perform in the architectural designing realm, too) [54].
Also self-supervised learning renders a promising path to advance machine learning. As opposed to supervised learning, which is limited by the availability of labeled data, self-supervised approaches can learn from vast unlabeled data [55], the model generating its own labels from the input data. Given a specific task, the self-supervised learning approach can outperform the reinforcement learning strategy in terms of data efficiency (learning from the structure of the input data itself, without needing explicit rewards or penalties), generalization (learning a more general understanding of the data, rather than optimizing for a specific reward function), stability (self-supervised learning often having more stable and predictable training dynamics than reinforcement learning that, especially in environments with sparse or delayed rewards, can sometimes be unstable or difficult to train), and in general being less demanding man-performed interventions like, for example, rewards design [56].
Self-supervised learning is the strategy that, among others, „runs“ Sora – a text-to-video model developed recently by OpenAI [57]. Mentioned further in this paper, Sora is designed to understand and simulate the physical world in motion, to solve problems that require real-world interaction – similar to how poiétic architectural concept designs come into existence. As promised, this paper will further delve into the idea of architectural robots elaborating designs of production architecture (the term coined in section (4)) making use and benefits of imitation-based, self-learning, and other novel machine learning strategies.
Finally yet importantly, heading to AI understanding and predicting others’ actions like humans do, the machine theory of mind (ToM) asserts an intriguing concept inspired by how humans understand each other based on past behavior. Analogically, the machine observes how agents (or people or robots) behave. Using meta-learning (learning how to learn), the machine builds models of the agents and predicts, how agents might act in the future, even in unprecedented situations. Such a strategy performs better interaction with humans, advances multi-agent AI where multiple AIs collaborate or compete, and makes AI more interpretable, transparent, and safe [58].