Artificial Intelligence and Robotics Converge

Xanatos Research into Robotic Cognition and Cognitive Architectures
for Artificial General Intelligence

These are not your father's robots: An Introspection from the Founder
And a shoutout to other Westworld fans - nobody ever said that you can't have fun while doing real research!

For the Coders:

GitHub Code Archive

All the non-proprietary code running in the robot hosts currently. The cognitive architecture and language processor are available currently only with a signed NDA. Contact xanatos@xanatos.com for more info.
Special thanks to my friend Scott Walker of Walker Machine for his impeccable work in fabricating my intricate and complex designs used in these robots.

Back in 2015, working on several programs that were designed to interact with humans in a human way, I began to wonder about the possibility of human level cognition in software-based systems. And so, I set out to discover for myself if this was a possibility. More specifically, I wanted to answer the questions that were posed by the possibility of sentient, as well as sapient - basically - conscious, systems.

Thus began a quest that has spanned over three years and thousands of hours, and tens of thousands of lines of code. I have reached my conclusions, but the journey is not complete. There are a myriad of challenges, and every one requires us to develop a significant level of introspection into our own, human, working processes. As I have learned more and more about how to make robots be "human", I have learned perhaps more indeed, about what it means to be human, and how human cognition itself works. The one conclusion I will reveal at the outset is this: the hype around neural networks and machine learning - while immensely useful and fruitful in many fields - is not the path in-and-of-itself to Artificial General Intelligence (AGI). To build AGI requires the same processes that a child uses to learn. We must create a personally navigable (personally meaning - to the AGI system) representation of the world and the understandings of those who are influential to it. But to build this, we must first build the methods to discern the world, and the meaningful information presented by it, in a way that allows us to further process and internalize it.

A Constellation of Complexity


Initial development of face recognition and mechanical face tracking with the gimballed eyes.

How do you, the reader, perceive the world? Certainly through sight, hearing, feeling, taste, scent... but we also have these same senses represented internally. You can "see" an image in your mind. You can also hear, feel, smell and recreate tastes within the sphere of your mind.

In order for any system - organic, or silicon - to begin to attain anything close to what we consider consciousness, it must be at least capable of apprehending the world around it in a useful manner. So the very initial conditions that must be satisfied in the hardware/software environment of any system that would host anything even close to a conscious cognitive architecture are those that will allow a useful apprehension of the world around it, namely, sight, hearing, feeling of some type; taste and scent are debatable. And certainly some methodology for putting out into that world some creation of its own, based on its processing and interpretations of its senses of the world around it - for example, speech. Subsequent generations could even include what could be called, art.

So, my robotic designs initially required these at the minimum in order to couch a cognitive architecture capable of rising above the current level of "chatbots" that were horrible at, well, basically anything resembling any form of coherent conversation.

And so I designed a robust system that incorporated each of the following systems as the foundational layer to support a cognitive architecture:


Early development of the motor functions for the eyes, head and neck.
  • Sight: Dual High-Res Cameras feeding to OpenCV for visual Processing that included real-time:
    • Face Recognition with "liveness" verification
    • Robust Object Detection for literally thousands of ordinary objects
    • Motion Detection
    • Stereoscopic Depth Mapping (Depth Perception)
  • Hearing: Quadrophonic Microphones to allow:
    • Speaker Identification (Speaker Diarization)
    • Directional Location of audio phenomena
    • Speech Recognition (Speech to Text)
  • Speech: A robust, non-robotic and modulatable (SSML) Text-to-Speech system
  • Motor Functions: Face Recognition and Object Recognition are useful, but being able to mechanically track them in the visual field is critical. In addition, motor functions add a layer of non-verbal speech augmentation that adds a very human touch to communication when properly synchronized with speech with embedded codings.
  • Interconnected Networked Communication Infrastructure: ZeroMQ.
    The ability to allow each of these distributed systems to function independently, yet with continual communications connectivity between each module, was critical to smooth operation. Weighted outputs allow the visual system to communicate immediately with the speech center if needed, or the visual or audio system to communicate with motor functions without intervening interpretations, roughly equivalent to the autonomic nervous system - reacting to loud sounds, sudden movements, etc.

With the above functions implemented, a robust system had been implemented that would serve as a cradle for a cognitive architecture capable of utilizing these input and output functions by processing the inputs based on ongoing experiences and learnings, and generating responsees and response potentials of increasing complexity and competency as experience increased. Some of this learning was neural network based, but some was much more simple.

Creating a Brain is Easy; Creating a Mind is Not

Sometimes, Brawn is better than Brains


An early video update showing several functions, including speech recognition, speech output, motor functions and face recognition/learning progress

In order to create a robot that can learn about the world around it, as well as interact with people in its environment, it had to do more than simply recognize speech, it had to understand it. There is nothing inherent in a software based system that understands human language. Speech Recognition uses statistical models and trained data to create very good text transcriptions of the speech it hears, but no where in there does the system understand one syllable of what is being spoken.

Programmers have become adept in creating various programs that appear, to a human, that the computer understands them in limited interactions, but this is artistry, and in truth, just a set of hard-coded structures that look at the input text derived from the heard speech and perform more-or-less complicated "if-then" functions on what it hears. If the computer hears "hello", "hi", "howdy", then respond with "hello to you too". Still, at no point does the computer "know" what is being exchanged. It sees a binary representation of text that matches a pre-coded binary representation of other text and it outputs a binary representation because it is told to do so by program code. Period.

So I knew that, in order for a system to be truly functional in an autonomous, and generally functional sense, it needed to be able to understand more about what it was "hearing" than this. Many researchers are working on the problem of extracting actual meaning from received communication. Most of these efforts are centered around statistical analysis of enormous amounts of text corpora (and I really mean enormous - hundreds of gigabytes of text data - literally enough that if it was printed, single spaced on double sided paper, the stack of printed paper would stretch nearly 2,300 miles high) just to be able to determine if someone has asked it a question, or made a statement... or uttered complete word salad for that matter. It's great that we have this ability, and processing power, but this is NOT how a human mind learns.

And so, I set out to determine how a human mind actually understands things like - whether someone speaking to me has uttered a statement, or asked a question. And fortunately, I had a mind available to study - my own, of course.

I began with the task of simply determining a question from a statement - a task researchers, as I mentioned above - are throwing enormous processing power and time towards. Tensorflow and other neural network libraries are all the rage now, but they are not always the answer. Over the course of a single afternoon, I realized that my brain knew whether it was being asked a question, or hearing a statement, from the first two words of the utterance I heard. And furthermore, there were a very specific set of words that could start a question - but it was not just the usual who, what, where, why, when, how words - because those same words can start statements as well as questions. There was something more - it was the specific combinations of words. And I formalized them into a spreadsheet, which I then converted into the first piece of code I wrote for the language processing portion of my cognitive architecture pyramid. See the image below (click to enlarge.)

This first step proved immensely successful at classifying input utterances with near instantaneous processing time. Of course, it required further tweaking to compensate for the oddities of human speech, such as injecting phrases like "well", or "so" at the beginning of a statement or questions, and I added further code to detect and function correctly within that context as well.

Now if you've read this far and haven't fallen asleep, you may be asking, "Well, isn't this just a complicated if-then structure?" And, you'd be correct in your question. It is, in fact, just such a structure. But apparently, it is one that is functionally active in my organic intuitions of language as a native speaker of English, and possibly yours as well. Try it out for yourself. I do not argue that if-then structures are not present in consciousness; I argue that the apprehension of language may in fact be full of them, but the conscious processes that come into play after linguistic utterances have been received and processed, cannot merely be such structures, because there is not enough RAM in the universe to hold all the possible permutations of what may be received, and replied to, to allow useful interaction. Something else must be developed. And so, I did.

Initial block diagram of the basic robot's systems. Click the image to enlarge, or click here for a PDF.

But back to the initial understanding of what is heard by the system... I now had a system that could translate speech to text, and actually understand if the speech was a question, a statement, something that required further processing, or something that was nonsensical, or that it possibly mis-heard/mis-transcribed. I went on to add several other categories because I found more and more linguistic structures in English that had very formal patterns - Conversational Postulates, There Is/There Are, statements, etc., all of which have recognizable signatures and can be formally coded for.

As I went on, I realized that speech contained an enormous number of contextual references, and I began building context-recognizing functions into my scripts. Over time, I realized that I needed both a short-term (ongoing conversation) and a long-term (previous conversations) memory, that allowed both analysis (yes, this time neural nets) and immediate reference, to what had come before, so I added in those logging functions.

What began as an effort to simply identify utterances slowly became an effort to respond usefully to those utterances. But I nearly immediately discovered that I was operating what I have come to call a "content free" system. While the system could identify the nature of the utterance, references to similar utterances, the type of response required (or if one was required at all), I had no real knowledge bases from which to respond. Initially, this was addressed in a very limited fashion by adding in local functions (the system could answer questions about the date, time, etc.), a dictionary (the system could then answer "what is a/an" if the words were in the dictionary), and limited questions about the nature and content of ongoing and previous conversations. This was most unsatisfying. I was up against the simple nature of anything learning about the world - it had to learn a little at a time. Fortunately, we have the ability now to load in vast knowledge bases of basic, fixed data - but that still isn't the same as a conscious, fluidic response. More development continued...

Language is NOT the Only Game in Town

Stepping away from one task, on to another, is often very useful. One needs vision...

As I started to come up against the harder issues in language processing, I took a breather to focus more on computer vision functions. I already had good face recognition running, and the system could verbally greet people it saw by name, as well as be introduced to new people that it would subsequently remember by name. I started to realize that many of the language processing functions could be informed and augmented by the visual system's functions. And then two separate events came to pass that would synergistically energize the project: StereoPi and Coral.

A CrowdSupply campaign called StereoPi hit the web, allowing one to do with a single Raspberry Pi Compute Module 3+ what I was designing to do with two separate Raspberry Pi boards. This single project allowed me to connect both of the cameras in my robot's eyes to a single processor and create the depth-mapping and face identification/learning/memory functions that I thought would have to be done on multiple systems. Then, a project in MagPi Magazine brought Google's Coral Edge TPU Accelerator to my attention, with a project design to make a "teachable Raspberry Pi", that could be taught to recognize objects with the simple push of a button (Click here to view full article and code).

Now since I had a much more robust system running already than just some pushbuttons and LEDs, I undertook recoding the basic system in the article to scale it way up, allowing it to begin the learning process with voice commands, and the recognition process with voice commands as well. I soon built an internal library of about 160 objects that the system recognized that were of interest to me personally, unlike the library built into object detection systems like YOLO, etc. I was able to teach the robot to recognize pine cones, and firewood. Yes - my robots will be doing a lot of work around my property for me as they develop.

More to come............. much more.......... typing fingers are numb.

RETURN TO XANATOS

© 2025 Xanatos. All rights reserved in all media.
You are strongly advised to read our Privacy and Use Policy.