All the non-proprietary code running in the robot hosts currently. The cognitive architecture and language
processor are available currently only with a signed NDA. Contact firstname.lastname@example.org
for more info.Special thanks to my friend Scott Walker of
Walker Machine for his impeccable work in fabricating my
intricate and complex designs used in these robots.
Back in 2015, working on several programs that were designed to interact with humans in a human way, I began to wonder about the possibility
of human level cognition in software-based systems. And so, I set out to discover for myself if this was a possibility. More specifically, I
wanted to answer the questions that were posed by the possibility of sentient, as well as sapient - basically - conscious,
Thus began a quest that has spanned over three years and thousands of hours, and tens of thousands of lines of code. I have reached my
conclusions, but the journey is not complete. There are a myriad of challenges, and every one requires us to develop a significant level
of introspection into our own, human, working processes. As I have learned more and more about how to make robots be "human", I have learned
perhaps more indeed, about what it means to be human, and how human cognition itself works. The one conclusion I will reveal at
the outset is this: the hype around neural networks and machine learning - while immensely useful and fruitful in many fields - is not the path
in-and-of-itself to Artificial General Intelligence (AGI). To build AGI requires the same processes that a child uses to learn. We must create
a personally navigable (personally meaning - to the AGI system) representation of the world and the understandings of those who are influential to
it. But to build this, we must first build the methods to discern the world, and the meaningful information presented by it, in a way that allows
us to further process and internalize it.
A Constellation of Complexity
Initial development of face recognition and mechanical face tracking with the gimballed eyes.
How do you, the reader, perceive the world? Certainly through sight, hearing, feeling, taste, scent... but we also have these same senses
represented internally. You can "see" an image in your mind. You can also hear, feel, smell and recreate tastes within the sphere of your mind.
In order for any system - organic, or silicon - to begin to attain anything close to what we consider consciousness, it must be at least capable
of apprehending the world around it in a useful manner. So the very initial conditions that must be satisfied in the hardware/software
environment of any system that would host anything even close to a conscious cognitive architecture are those that will allow a useful apprehension
of the world around it, namely, sight, hearing, feeling of some type; taste and scent are debatable. And certainly some methodology for putting
out into that world some creation of its own, based on its processing and interpretations of its senses of the world around it - for example,
speech. Subsequent generations could even include what could be called, art.
So, my robotic designs initially required these at the minimum in order to couch a cognitive architecture capable of rising above the
current level of "chatbots" that were horrible at, well, basically anything resembling any form of coherent conversation.
And so I designed a robust system that incorporated each of the following systems as the foundational layer to support a cognitive
Early development of the motor functions for the eyes, head and neck.
Sight: Dual High-Res Cameras feeding to OpenCV for visual Processing that included real-time:
Face Recognition with "liveness" verification
Robust Object Detection for literally thousands of ordinary objects
Stereoscopic Depth Mapping (Depth Perception)
Hearing: Quadrophonic Microphones to allow:
Speaker Identification (Speaker Diarization)
Directional Location of audio phenomena
Speech Recognition (Speech to Text)
Speech: A robust, non-robotic and modulatable (SSML) Text-to-Speech system
Motor Functions: Face Recognition and Object Recognition are useful,
but being able to mechanically track them in the visual field is critical. In addition, motor functions add a layer of non-verbal speech
augmentation that adds a very human touch to communication when properly synchronized with speech with embedded codings.
Interconnected Networked Communication Infrastructure: ZeroMQ.
The ability to allow each of these distributed systems to function independently, yet with continual
communications connectivity between each module, was critical to smooth operation. Weighted outputs allow the visual system to
communicate immediately with the speech center if needed, or the visual or audio system to communicate with motor functions without
intervening interpretations, roughly equivalent to the autonomic nervous system - reacting to loud sounds, sudden movements, etc.
With the above functions implemented, a robust system had been implemented that would serve as a cradle for a cognitive architecture capable
of utilizing these input and output functions by processing the inputs based on ongoing experiences and learnings, and generating responsees and
response potentials of increasing complexity and competency as experience increased. Some of this learning was neural network based, but some
was much more simple.
Creating a Brain is Easy; Creating a Mind is Not
Sometimes, Brawn is better than Brains
An early video update showing several functions, including speech recognition, speech output, motor functions and
face recognition/learning progress
In order to create a robot that can learn about the world around it, as well as interact with people in its environment, it had to do more
than simply recognize speech, it had to understand it. There is nothing inherent in a software based system that understands human language.
Speech Recognition uses statistical models and trained data to create very good text transcriptions of the speech it hears, but no where in there
does the system understand one syllable of what is being spoken.
Programmers have become adept in creating various programs that appear, to a human, that the computer understands them in limited
interactions, but this is artistry, and in truth, just a set of hard-coded structures that look at the input text derived from the heard speech
and perform more-or-less complicated "if-then" functions on what it hears. If the computer hears "hello", "hi", "howdy", then respond with
"hello to you too". Still, at no point does the computer "know" what is being exchanged. It sees a binary representation of text that matches
a pre-coded binary representation of other text and it outputs a binary representation because it is told to do so by program code. Period.
So I knew that, in order for a system to be truly functional in an autonomous, and generally functional sense, it needed to be able to understand
more about what it was "hearing" than this. Many researchers are working on the problem of extracting actual meaning from received communication.
Most of these efforts are centered around statistical analysis of enormous amounts of text corpora (and I really mean enormous - hundreds of gigabytes
of text data - literally enough that if it was printed, single spaced on double sided paper, the stack of printed paper would stretch nearly 2,300
miles high) just to be able to determine if someone has asked it a question, or made a statement... or uttered complete word salad for that matter.
It's great that we have this ability, and processing power, but this is NOT how a human mind learns.
And so, I set out to determine how a human mind actually understands things like - whether someone speaking to me has uttered a statement, or
asked a question. And fortunately, I had a mind available to study - my own, of course.
I began with the task of simply determining a question from a statement - a task researchers, as I mentioned above - are throwing enormous
processing power and time towards. Tensorflow and other neural network libraries are all the rage now, but they are not always the answer. Over
the course of a single afternoon, I realized that my brain knew whether it was being asked a question, or hearing a statement, from the
first two words of the utterance I heard. And furthermore, there were a very specific set of words that could start a question - but
it was not just the usual who, what, where, why, when, how words - because those same words can start statements as well as questions. There was
something more - it was the specific combinations of words. And I formalized them into a spreadsheet, which I then converted into the first
piece of code I wrote for the language processing portion of my cognitive architecture pyramid. See the image below (click to enlarge.)
This first step proved immensely successful at classifying input utterances with near instantaneous processing time. Of course, it required
further tweaking to compensate for the oddities of human speech, such as injecting phrases like "well", or "so" at the beginning of a statement
or questions, and I added further code to detect and function correctly within that context as well.
Now if you've read this far and haven't fallen asleep, you may be asking, "Well, isn't this just a complicated if-then structure?" And, you'd
be correct in your question. It is, in fact, just such a structure. But apparently, it is one that is functionally active in my organic
intuitions of language as a native speaker of English, and possibly yours as well. Try it out for yourself. I do not argue that if-then
structures are not present in consciousness; I argue that the apprehension of language may in fact be full of them, but the conscious processes
that come into play after linguistic utterances have been received and processed, cannot merely be such structures, because there is not enough
RAM in the universe to hold all the possible permutations of what may be received, and replied to, to allow useful interaction. Something
else must be developed. And so, I did.
Initial block diagram of the basic robot's systems. Click the image to enlarge, or click
here for a PDF.
But back to the initial understanding of what is heard by the system... I now had a system that could translate speech to text, and actually
understand if the speech was a question, a statement, something that required further processing, or something that was nonsensical, or that
it possibly mis-heard/mis-transcribed. I went on to add several other categories because I found more and more linguistic structures in English
that had very formal patterns - Conversational Postulates, There Is/There Are, statements, etc., all of which have recognizable signatures and
can be formally coded for.
As I went on, I realized that speech contained an enormous number of contextual references, and I began building context-recognizing functions
into my scripts. Over time, I realized that I needed both a short-term (ongoing conversation) and a long-term (previous conversations) memory,
that allowed both analysis (yes, this time neural nets) and immediate reference, to what had come before, so I added in those logging functions.
What began as an effort to simply identify utterances slowly became an effort to respond usefully to those utterances. But I nearly immediately
discovered that I was operating what I have come to call a "content free" system. While the system could identify the nature of the utterance,
references to similar utterances, the type of response required (or if one was required at all), I had no real knowledge bases from which to respond.
Initially, this was addressed in a very limited fashion by adding in local functions (the system could answer questions about the date, time, etc.),
a dictionary (the system could then answer "what is a/an" if the words were in the dictionary), and limited questions about the nature and content
of ongoing and previous conversations. This was most unsatisfying. I was up against the simple nature of anything learning about the world -
it had to learn a little at a time. Fortunately, we have the ability now to load in vast knowledge bases of basic, fixed data - but that still
isn't the same as a conscious, fluidic response. More development continued...
Language is NOT the Only Game in Town
Stepping away from one task, on to another, is often very useful. One needs vision...
As I started to come up against the harder issues in language processing, I took a breather to focus more on computer vision functions.
I already had good face recognition running, and the system could verbally greet people it saw by name, as well as be introduced to new people
that it would subsequently remember by name. I started to realize that many of the language processing functions could be informed and
augmented by the visual system's functions. And then two separate events came to pass that would synergistically energize the project: StereoPi
A CrowdSupply campaign called StereoPi hit the web, allowing
one to do with a single Raspberry Pi Compute Module 3+ what I was designing to do with two separate Raspberry Pi boards. This single
project allowed me to connect both of the cameras in my robot's eyes to a single processor and create the depth-mapping and face
identification/learning/memory functions that I thought would have to be done on multiple systems. Then, a project in MagPi Magazine brought
Google's Coral Edge TPU Accelerator to my attention, with a project design to make a "teachable Raspberry Pi", that could be taught to recognize
objects with the simple push of a button
(Click here to view full article and code).
Now since I had a much more robust system running already than just some pushbuttons and LEDs, I undertook recoding the basic system in the
article to scale it way up, allowing it to begin the learning process with voice commands, and the recognition process with voice commands as well.
I soon built an internal library of about 160 objects that the system recognized that were of interest to me personally, unlike the library
built into object detection systems like YOLO, etc. I was able to teach the robot to recognize pine cones, and firewood. Yes - my robots will be
doing a lot of work around my property for me as they develop.
More to come............. much more.......... typing fingers are numb.