Gary Sussman made an interesting presentation1 about artificial intelligence during the European LISP Symposium. He started his presentation by prompting a GPT machine to define a nonexistent theorem and surprisingly produced a reply that would pass the Turing test.
How did the GPT come up with the answer? He explained that “[GPT] seems to have assimilated a huge amount of data from the network … It does much scraping of the network and trains up enormous function approximation machine, implemented as some neural net, which is a composition of very large functions with a very large number of parameters and, using the data it has, it employs the statistical methods to predict a continuation of the text that is either received from the person or the text that it is adding to this long piece of this corpus of text.
The above quote refers to the semantically coherent and grammatically perfect nonsensical reply from the GPT. Are the developers taking shortcuts to launch an unfinished product to win the AI race, or are they convincing people to believe that they are the first to invent the holy grail of AI?
However, based on Sussman’s presentation, it is because of a different impossibility altogether. He further pointed out that machine, GPT in particular, differs from the human mind, citing Chomsky’s Poverty of Stimulus argument. He expounded, “There is π x 10⁷ seconds in a year. On average, no child learns more than one new utterance every minute, and the child sleeps half the time. So it hears at most 10⁶ utterances, but most are redundant and not very grammatical, yet after three years, kids are pretty sophisticated in language and common sense knowledge.“
The increasing interest, development, and evolution in AI ushered in the fourth industrial revolution. However, today’s AI cannot scale a child’s mind. In theory, developers can devise a monolithic machine to do a simulation. How much power does it take to replicate, simulate, and design a human mind?
Quoting this from Nvidia’s website²:
- It is no longer possible to fit model parameters in the main memory of even the largest GPU.
- Even if the model can be fitted in a single GPU (for example, by swapping parameters between host and device memory), the high number of compute operations required can result in unrealistically long training times without parallelization. For example, training a GPT-3 model with 175 billion parameters would take 36 years on eight V100 GPUs, or seven months with 512 V100 GPUs.
It is fair to agree with Sussman that a monolithic machine like this “needs the energy of a small city for a good part of a year!” It is also significant to point out that this only covers verbal interaction. How much more if we build and train this machine to cover all the possible behaviors we consider intelligent human behaviors? This undertaking might need an energy source that can power an entire country. Can we sustain this much excess in power consumption?
We are simply talking about the training—not the probing part, the learning from it. We cannot investigate these machines scientifically. Even the individuals who make them cannot afford them due to the substantial production and testing costs. It takes hundreds of trials to grasp what the thing is trying to do. Sustainability is not the issue here, but the reality of how much energy is needed to process data, train, run, and test complex systems. Projection of the future then should be contextualized with this in mind.
References:
- European Lisp Symposium. “Artificial Intelligence: A Problem of Plumbing?” YouTube, July 4, 2023.
- https://www.youtube.com/watch?v=CGxbRJPCoAQ. In-Text Citation: “Artificial Intelligence: A Problem of Plumbing?”
- NVIDIA Technical Blog. “Scaling Language Model Training to a Trillion Parameters Using Megatron | NVIDIA Technical Blog,” March 22, 2023. https://developer.nvidia.com/blog/scaling-language-model-training-to-a-trillion-parameters-using-megatron/.
In-Text Citation: “Scaling Language Model Training to a Trillion Parameters Using Megatron | NVIDIA Technical Blog.”