• Turing Post
  • Posts
  • ๐ŸŽ™๏ธAre Memories the missing half of AI brain?

๐ŸŽ™๏ธAre Memories the missing half of AI brain?

Why AI Intelligence is Nothing Without Visual Memory | Shawn Shen on the Future of Embodied AI

๐ŸŽ„SPECIAL OFFER ๐ŸŽ„:
Join us in celebrating Turing Postโ€™s journey so far. From our first 17 readers in May 2023, Turing Post reaches over 205,000 AI professionals across platforms in 2025.

And honestly, you canโ€™t even imagine what weโ€™re preparing next. This is your chance to lock in a discounted annual subscription before prices go up and access expands.

Get access to the best analysis. Hurry up โ€“ the offer is time-limited

Now to the show: We often confuse intelligence with memory. But in the human brain, reasoning and retrieval are separate processes. Shawn Shen โ€“ founder of Memories.ai and former researcher at Meta Reality Labs โ€“ believes that for AI to move from chatbots to the physical world โ€“ into robots, glasses, and wearables โ€“ it must stop trying to memorize everything in the model weights and start "seeing" like a human. 

He explains why the next leap in AI is about long-term visual persistence and breaks down why today's Transformers struggle with object permanence and how his team is building the "hippocampus" for embodied agents. Shawn also told me that they are developing a world model architecture to solve contextual awareness! Super interesting. Youโ€™ll learn a lot.

Subscribe to our YouTube channel, or listen the interview on Spotify / Apple

In this episode of Inference, we get into:

  • "Encode for Machine": Why we need to stop compressing video for human eyes and start compressing it for AI logic.

  • The critical architectural split between the Intelligence Model (creative, generative) and the Memory Model (retrieval, factual).

  • Why Transformers donโ€™t understand physics or time โ€“ and why World Models are the answer.

  • The brutal engineering constraints of running infinite visual memory on-device without Wi-Fi.

  • How to build a system that remembers your life without becoming a surveillance nightmare.

  • Why The Mom Test is the most important book for researchers transitioning to product builders.

We also discuss the "Era of I Don't Know" in research, the limitations of current context windows, and the future where your smart glasses actually know if youโ€™ve been eating healthy this week.

This is a conversation about the missing half of the AI brain. Watch it!

This is a free edition. Upgrade if you want to receive our deep dives directly in your inbox. For a few more days itโ€™s only $56/year (20% OFF)

We prepared a transcript for your convenience. But as always โ€“ watch the full video, subscribe, like and leave your feedback. It helps us grow on YouTube and bring you more insights โฌ‡๏ธ

Ksenia Se: Hello everyone. Today I'm joined by Shawn Shen, who left Meta Reality Labs with a provocative thesis: that intelligence without long-term visual memory isn't really intelligence. He co-founded Memories.ai to build the world's first Large Visual Memory Model. Welcome, Shawn!

Shawn Shen: Thank you so much, Ksenia.

Ksenia: You guys emerged from stealth just recently in July of this year, right? That was when I first noticed you, and I thought about all the sci-fi movies that I've watched, because in these movies, the memory, the recognition, the search through memories โ€“ it's all solved. But we're not in the movies. So, do you think that visual memory specifically is the thing that unlocks intelligence?

Shawn: I think in the future, intelligence is going to be embodied. What that means is that your robots, smart glasses, smart wearables, your cameras โ€“ cameras on the street, cameras in the home, anything that has a physical camera โ€“ that is going to be the real embodied intelligence, and that is going to be the real intelligence. For that intelligence to work, it has to have visual memories, because you can't have an AI that's able to see but not remember what it has seen. So they need to have this visual memory.

We also take it from a very first-principles point of view. We think about how human memory works, how human cognition systems work. Our cognition system works by two things, right? One is intelligence. One is memory. They're totally separate and in parallel. And if we are going to build better AI similar to how humans work, then we also need to split this into intelligence and memory. So memories will make intelligence better.

And what are memories? I define them as visual memories, because most of our memories are actually visual. For example, if I ask you: When was the last time you had an amazing burger? How many times you went to the gym? Are you eating healthy this week? If I ask you these questions, you would usually recall what you have eaten, the gym, et cetera โ€“ all visually, all vividly. And then you recall that and you reason on top of that using your intelligence.

So what we build is this encoding and retrieval process. We call this memories.

Beyond Transformers: Building the Large Visual Memory Model

Ksenia: Fascinating, but super hard to build. Your model is called Large Visual Memory Model. As far as I understand, it's still based on Transformer architecture, right? Or have you moved beyond that?

Shawn: It's still based on Transformer architecture. There are some other architectures โ€“ for example, Mamba โ€“ but the only model architecture that still scales, the only one that you can scale up using data, Transformer is still the best option. So we're still using Transformer architecture, but we're training this model for very different purposes than other models leveraging Transformers.

For example, large language models are trained to be creative, to be intelligent. And sometimes they have hallucinations, right? But when we're training our Large Visual Memory Model, it is essentially an all-in-one embedding model. What that means is that it turns all the videos and actually all the contexts โ€“ including audios, text, actions, everything โ€“ into the same embedding space.

And we don't need any creativeness. We just want to turn all those different multimodal data into embeddings that can ultimately be losslessly reconstructed back to the original formats. So this is how we train the model. The way that we train intelligence models and the way we train memory models are fundamentally different.

Ksenia: I see. But when you talk about memories, you talk more about the brain. And when you talk about the model, you talk more aboutโ€ฆ sort of a database. So in your understanding, is the ultimate memory model closer to the brain or closer to the database?

Shawn: So the ultimate memory model is composed of two parts. It's a system. It's not an end-to-end model itself. Just think about how human memory works. Our memories are based on retrieval and reconstruction.

When we see things โ€“ for example, I'm 29 years old, I have 29 years of visual memories โ€“ I index the world in real time and store all of this in my memories. And whenever you ask me a question, I retrieve from it. So our human brain also operates similar to a system.

This system is composed of two very important key components. One is this indexing and encoding process. The other is the retrieval process. So what we built is also these two key parts. We built state-of-the-art indexing models, which is our Large Visual Memory Model. And then we also built this very AI-native, video-native, robust retrieval system. And on top of that, people can build different multimodal AI agents to enable different applications for embodied AI.

The World Model Evolution

Ksenia: But do you feel that Transformer kind of limits you in that? Do you look for other architectures? Is there research for you there?

Shawn: Yes, of course. We're actually looking at leveraging world models to build this embedding model and indexing model. Because what we found out is that, for example, the current Transformers or current model architectures don't really understand the world and don't really understand physics. They don't really understand the temporal contextual awareness.

A very simple example: when we're reaching for something, getting something, and then when we put things back, how does it know that, for example, this AirPods is still my AirPods? And also how does it know that Ksenia, when Ksenia changes an outfit or changes hair color or even turns her back to me, is still Ksenia? How do you enable this contextual awareness on the temporal side of objects, of humans? Transformer architecture doesn't support that.

So now we're developing a world model architecture to solve that. We haven't launched it, but this is our current state-of-the-art work.

Ksenia: And you develop it all in-house? From scratch, your own models?

Shawn: Yes, we develop all of the models in-house because we are a bunch of research scientists coming from all different big lab backgrounds. So we have pretty long-standing experience building models, and especially building multimodal AI models ourselves.

From Neuroscience to Customer Needs

Ksenia: I wonder how your process is going. Do you have a big picture of the brain and say, "We solved this part of it, now we're moving forward"? How do you plan what to solve next?

Shawn: That's a really good question. We actually started by working on how human memory works, how human brain works. I started as a computational neuroscientist, and then I moved to computer vision. At the same time, if you search for "AI memory human survey," you'll see a survey made by us called Human-inspired Perspectives: A Survey on AI Long-term Memory. We specifically looked at how human memory works. And we made out this big blueprint of this huge memory system.

Now we've solved most of that part. And I think the next thing that is leading us is not from the technology โ€“ it's actually from the demand.

For us phase zero was studying how human memory works, and then we made this technology inspired by the human brain. We made this technology, we launched it. And now, since it's been three, four months since we launched, we've gotten a lot of attraction. We have over 100 inbounds and we've signed over 60, 70 contracts already.

All of those inbounds really told us what the market needs, what customers want. From those needs, we figured out: Okay, human tracking is very important. Human identification is important. Object tracking is important. Object identification is important. How to build a knowledge graph around humans is important. All of this is actually coming from customer needs. And now our next phase is to build models around that part.

Adjacent Players, Different Goals

Ksenia: Well, this demand is showing that it's really needed. But there are other companies trying to solve the memory problem, right? Like Twelve Labs. What is different? How do they see memory?

Shawn: I think the whole market is really, really new, right? Visual intelligence, making AI able to see the world just like humans do โ€“ the whole market is super new. So there aren't too many players. A lot of people will think these are similar players, but we are adjacent players. There's a number of different adjacent players around us, but we all have different approaches.

We probably start similarly โ€“ we're all building embedding models, we're all building indexing models, we're all building video search systems โ€“ but we serve different purposes. I'm not exactly sure what the big picture of Twelve Labs is, but at least from what I see, their ideal customer profile is more aligned with studios and publishers.

But what we want to build is not that. What we want to build is actually a really human-like visual memory system to power the future embodied AI, especially humanoids, to have human-like visual memories. Because we have different ultimate goals, the technology approach will be very different as well. And also the target market will be quite different too.

And if that is our goal, we have to make all the models, or at least all the indexing and data storage, on-device. That is why we partnered with Qualcomm. Imagine the future humanoid โ€“ you can't really imagine that future humanoids will need Wi-Fi every second and upload all their videos to the cloud. That will create a lot of burden on bandwidth. It's just not feasible.

So it is critical to be able to make the model very, very small and also customized for that embodied AI scenario so that they can run in low power consumption on local chips so that they can process all the videos in real time. So that is our goal. And then, how to remember people's faces and how to build the personal profile and how to build the knowledge graph between all the different users, et cetera โ€“ those are all specifically designed to suit the future embodied AI use case.

Encode for Machine, Not for Human

Ksenia: There are so many things to unfold here. First of all, video is brutally computationally expensive. Then making it on-device is brutally hard. The whole disconnection from Wi-Fi thing is also not solved yet. Let's start maybe โ€“ how do you rethink compression and indexing just to make long-term memory viable at all?

Shawn: Exactly. So, humans have been inventing compression technologies for videos, right? But all those compression technologies have encoding and decoding processes. Those compression technologies are made for humans. So when videos are encoded, they can be decoded into video formats that can be readable by, or understandable by, humans.

But in the future, do we actually need to make those encoding processes so that they can be decoded back to be understood by humans? No. We're now specifically designing compression and indexing processes specifically for AI. We call this encode for machine, not encode for human.

So when we train models, we actually jointly train the compression and indexing algorithms and models together so that the ultimate output is directly processed by AI itself.

What to Remember, What to Forget

Ksenia: How do you decide what should be remembered and what should be forgotten? Is it programmed? Is it learned?

Shawn: What to remember, what to forget โ€“ that is less of the fundamental question here. The fundamental question, as I mentioned, is the indexing problem, the retrieval problem.

In terms of what to forget, what to remember, there are two approaches, and we're actually doing both.

One approach is how current text-based memory companies are building this. For example, Mem0, Letta, et cetera. People are building agentic memory agents or context engineering to determine what to forget, what to remember, to write different pipelines for episodic memory, procedural memory, et cetera. All of them are context engineering โ€“ you manage different contexts in different modules. So that is one way to manage all those contexts. And I think currently it's probably pretty efficient to manage in that way.

But what we're thinking is: in the long run, when we're training this indexing plus this world model-based indexing, all of this should be automatically dealt with. It will have this emergent capability similar to how ChatGPT, when it came out, has these emergent capabilities. It can automatically โ€“ it has this core of memorizing things.

For example, ChatGPT is essentially also a memory model of the whole internet data, right? It's essentially a reconstruction of the whole internet data. It understands, it remembers what is important, but also it forgets what is less important. It remembers the logic, but it also forgets some of the fine details. And I think we should do the same as well.

So this is the direction we're going to. We haven't fully figured this out, but I think that's the direction we're trying to build. What is currently lacking is not only the technical approach, but also the data itself. There isn't any data that exists on the internet, or large-scale data on the internet, that composes human life and especially the visual memories of human life labeled with all the actions in it. We just lack that data.

But I think as long as we have enough of those data, there will be a way to train a real world model-based Large Visual Memory Model that has this fully emergent capability on this forgetting mechanism.

The Data Feedback Loop

Ksenia: Do you think what you're doing with the model will help with data, that it will provide it also to the physical AI companies?

Shawn: Yeah, that's right. I think the model ultimately will serve a product, and our model will serve products such as AI glasses, robotics. Once the model is powerful enough to make a good product โ€“ for example, a good AI glasses product or a good AI wearable product or a good humanoid product โ€“ and people start using it, there will be more and more data generated from those products.

We're in the first phase of making a good model and then making a good product. And once we have good products, then we have good quality of data. Once we have good quality of data, people can contribute their data. Of course, we need to be very careful about privacy, but people can contribute their own data to this research project to make this real intelligence with memories. Our technical approach goes phase by phase.

Current Bottlenecks

Ksenia: That's a very big dream, a massive idea that you're building. You mentioned some of them, but just to understand: What are the real bottlenecks that you immediately face? Can you list them?

Shawn: Of course. When we're actually deploying our models onto device, there's always a trade-off between how good the model is and how much efficiency you want it to have. Even though our models can run on device without any problem, when you're trying to make a good product, you still need to sacrifice a lot of efficiencies because there are a lot of other models running on device.

So what we need to do is to make the model smaller, much smaller, and also make the model run faster and also more accurate.

And when now we get all these actual needs โ€“ like human identification, object tracking โ€“ these capabilities are something that we haven't trained into the models before. Now what we're trying to do is train those features into the models so that the models can have this contextual awareness of the human face, of the objects โ€“ not only the human face, but the whole human.

So they can recognize the human in the temporal perspective by not only the face, but also the walking style, the dressing style, or the way that they behave. We are training those features into the model itself.

Or for example, another example could be multimodal speaker diarization or speaker recognition. Previously, speaker recognition could only be done using audios. Now we are training this model to have multimodal speaker diarization so that it can recognize who is speaking what and when, not only just from audios, but also from the visuals.

So all of this are very important once we talk to our customers. We're now working with some of the top AI smart glasses companies, some of the top smart mobile phone companies, some of the top humanoid companies, and they all have very similar needs around human tracking, human identity, speaker recognition, speaker diarization, object tracking, object recognition. The needs are similar, and now we're trying to train those features into our models.

Useful, Not Creepy

Ksenia: What you're building is a system that basically never forgets what it sees, right? Which is very helpful for many industries, as you just mentioned. But it also can be very dangerous. So how do you design a memory system that is useful but not creepy?

Shawn: When we are training the models โ€“ again, we are not training an intelligence model that can be creative, that can output something that you didn't say.

Ksenia: But what about the emergent possibilities of it?

Shawn: The emergent possibilities are only about the things that you forget or basically how it organizes different information, but it won't give you wrong information. Of course, there could be some potential hallucinations, but we are trying to minimize the hallucination to very little.

We're not trying to replicate the whole human memory system because humans can forget, right? Do you need machines to forget? Sometimes yes, sometimes no. But machines totally have the capability not to forget.

So when we are building this indexing, all these models, the model itself is only transporting all the videos and contexts, all the different contexts, into another context format, but in the latent space. So it doesn't generate new things. It doesn't generate new ideas, even for the emergent capabilities. It's in the perspective that the data formats will be arranged in different formats and structures in this latent space. But no new data or no new ideas will be generated.

So compared to intelligence, it is much safer. Whereas intelligence itself โ€“ we're afraid of intelligence because when there's more emergent capability coming out of intelligence, it can easily create new ideas, new things that you don't know. It's a dilemma, right? You want the intelligence to be really, really powerful, to even think beyond humanity, but then at the same time, you want it to think in a safe way as well.

So that itself is a dilemma. But for our models, for the memory models, it's purely about indexing โ€“ how to index all the context into another type of context, but stored in the latent space that can be understood by AI. So it inherently doesn't have a lot of risk compared with intelligence models or language models.

Powering Intelligence, Not Replacing It

Ksenia: So you're not trying to solve intelligence.

Shawn: Yeah, we're not trying to build intelligence. We're trying to power intelligence to be an intelligence that really, really understands the world just like humans do.

For example, do you really need a superintelligence in your home to do laundry? Probably not. You probably need a general intelligence, but then one that really, really understands you. Really understands you and your family. Knows your hobbies and knows how to do laundry well according to your own lifestyle or working style.

Or imagine that you have smart glasses, AI glasses. OpenAI is going to launch their hardware in the future as well, right? Those products are all going to have cameras and microphones that are going to record your days. They're going to have contextual awareness of your day and then help you with your productivity.

Do you need them to be, again, superintelligent that can do everything? Or do you need them to be really, really understanding of you, personalized to your own context?

Ksenia: I mean, I might not need it, but it seems that if you're able to solve this problem, if you're able to solve the memory problem, there will always be someone who will be eager to build on it. Because the memory problem is so complicated. But when you solve it, building on it towards superintelligence or AGI โ€“ I'm very confused about the terms because everyone has their own description for it โ€“ but that's what I guess makes it a little scary. Like if you solve this problem, then people can use it.

Shawn: That's true. I mean, all technologies are double-edged swords. All technologies, when they come to where they exceed people's imagination, naturally become very scary because we just donโ€™t know whatโ€™s next.

But yeah, when people are building things on top of it for the wrong purpose, it can be scary. I would say so as well. But what we want to solve is: we really want to make good for humanity, to make the future AI not just focus on superintelligence itself, but actually focus on humanity. They can really develop their own personality because they have memories. They can really understand you so that they are super personalized. And eventually they can even create a bond between you and them. That is the goal that we're targeting.

For example, ChatGPT's intelligence is more than enough to me. What I need is ChatGPT to really understand what I'm doing on a daily basis. So I don't need to type in all the prompts or the context or the background of the activity, but then just say, "Hey, do this for me," and then it instantly understands what I want to do. I think that is something that I want โ€“ to really decrease the friction between how AI communicates with humans or how humans communicate with AI.

AGI: A Practical Perspective

Ksenia: You seem to be a very practical person. Do you ever think about AGI and superintelligence just because you're in this industry? What is it?

Shawn: As you said, people have different definitions of superintelligence or AGI. Some people even say superintelligence is already here, that in some areas AI is better than some humans. I mean, ChatGPT is better than me in terms of medicine, in terms of law, et cetera, right?

But it doesn't really mean that it is better than humans overall. For example, planes can fly, but it doesn't mean that they can replace me. I think they're just different tools. I treat them as tools. I don't treat them as another type of being.

Ksenia: You don't think they're sentient. You think it's a tool.

Shawn: That's right. That's right. I also don't tend to overthink it too much. I want to solve the problems I want to solve.

But yeah, I think AGI could be something that is coming in the next 10 to 20 years. But I don't think AGI is really coming anytime soon. As Ilya also mentioned in one of the podcasts previously, the scaling law is not working and we are going back to the research era. The real AGI โ€“ at least to most people's understanding of what should be AGI โ€“ the AGIs in those sci-fi movies will probably come in the next 10 to 20 years, but not anytime soon.

Ksenia: Yeah, I called it the era of "I don't know," which is super resourceful for researchers.

Shawn: Yeah, exactly. And it's the same for world models. What are world models? Again, world models are still in the research phase. People have different definitions of world models. Some define world models as 3D reasoning models. But others, like Yann LeCun, define world models differently.

Ksenia: Yeah, multimodal is definitely one of the new frontiers for breakthroughs, I believe.

Shawn: Yeah, it's very new. It's super, super new.

The Mom Test: Building What People Want

Ksenia: My last question is always about books. What is the book that formed you or maybe influenced you just recently?

Shawn: I come from a research background. I come from an academia background. I always thought that it's about technology push, not market pull. So we started with the technology. We built our product around the technology, and I think that was right to do in the first place.

But now, once we launched our product โ€“ which is a wrapper of the technology โ€“ and when we get a lot of customers, I think what is really important is to listen to the customers, trying to figure out what they need. Now we're building the product around the customer needs, not around the technology. Or now we're building the technology from the customer needs too.

There was a book that I think a lot of salespeople and product managers have probably read about, which is called The Mom Test.

Ksenia: The Mom Test, okay. Sounds good.

Shawn: It's basically about how to really ask good questions to your target users about what are the things that they really need, what are the things that they really want, and the things that they will potentially pay for. So this is quite an important book for me, in my current stage.

Because we want to build a technology that people want. We don't want to build a technology that people don't want.

Ksenia: But sometimes people just don't know what they want.

Shawn: That is true. The Mom Test actually gives some good examples around that. You actually don't ask, "What do you want?" You ask, "What is the way that you're currently doing this?" You don't tell them the solution straight away. It's a solution-neutral sort of problem. You ask them about their current ways of doing this, the current solutions. "How do you find this?"

So you don't propose your solution first, because if you propose a solution first, either they will say, "Oh, it's cool," but they don't really use it. Or they will say, "Why is a car better than a horse? A horse is fast enough."

But ultimately, I think it's building something that people want, people need. Really make a deep dive into what they really need by asking good questions. I think that is really important. But also at the same time, what we're trying to be good at is how to execute as fast as we can, how to build a quick MVP as fast as we can. Then we can also present to the users, "Hey, just use it. Let me see how it works for you."

The Mom Test was a good book to help me understand all that.

Ksenia: That's very interesting. I haven't heard about this book. So perfect suggestion.

Do leave a comment

Login or Subscribe to participate in polls.

Reply

or to participate.