fwrog-e: The Chatbot That Almost Hopped

My initial sketches for fwrog-e

What on Earth Was fwrog-e?

Picture this: a virtual pet frog that could actually chat with you. That was fwrog-e in a nutshell - my crack at mixing fun visuals with some clever AI trickery. It was a wild ride, full of "aha" moments and head-scratching challenges.

While fwrog-e never quite became the chatty amphibian of my dreams (blame it on the learning curves, some pesky tech limitations and my own ambition), it was one heck of a learning experience. Let's dive into this ribbiting adventure!

The Big Idea

I wanted to create something simple yet engaging:

  • A wake word to get its attention (like "Hey froggy!")
  • The ability to pick up on emotions and spice up our conversations
  • fwrog-e to evolve and learn from our chats
  • and of course, a touch of humor to keep things light

Easy enough right? Well, not quite. But hey, that's half the fun!

Peeking Under the Lily Pad: The Tech Stuff

In a nuthsell here's how I pieced fwrog-e together.

How fwrog-e's guts worked

fwrog-e was more than just a pretty face. Its inner workings were a complex dance of various technologies. Let's dive into some of the key components:

Real-time Communication

At the heart of fwrog-e was a WebSocket server that enabled real-time, bidirectional communication. This allowed for smooth, instant interactions between the user and our amphibious friend.

@app.websocket("/")
async def whispers(websocket: WebSocket):
    try:
        user_id = generate_unique_id()
        queues[user_id + "transcription"] = asyncio.Queue()
        queues[user_id + "reasoning"] = asyncio.Queue()
        socketManagers[user_id] = SocketManager()
        whisperings[user_id] = WhispersSession(
            queues[user_id + "transcription"], queues[user_id + "reasoning"], socketManagers[user_id])
        await socketManagers[user_id].connect(websocket)

        # Main WebSocket loop
        while True:
            recording = await websocket.receive_json()
            recording = recording['audio_data']
            await queues[user_id + "transcription"].put(recording)
            whisper_task = asyncio.create_task(
                whisperings[user_id].process_audio_data_from_queue(websocket))
            response_task = asyncio.create_task(
                whisperings[user_id].get_ai_response(websocket))
    except WebSocketDisconnect:
        # Handle disconnection
        socketManagers[user_id].disconnect(websocket)
        # Clean up resources

This setup allowed for handling multiple users simultaneously, each with their own unique session.

The Ears: Audio Processing and Transcription

fwrog-e wasn't just a good listener; it was an excellent one, thanks to its sophisticated audio processing pipeline.

async def process_audio_data(audio_data: Array[np.int_], prompt: str = None) -> str:
    config = __AudioProcessingConfig()
    audioFloat32 = _int2float(audio_data)
    new_confidence = config.vad_model(torch.from_numpy(audioFloat32), 16000).item()

    if new_confidence > 0.5:
        write("example.wav", 16000, audio_data.astype(np.int16))
        inputs = {
            'audio': open("example.wav", "rb"),
            'model': "small",
            'transcription': "plain text",
            'translate': False,
            'language': 'en',
            'temperature': 0,
            'suppress_tokens': "-1",
            'condition_on_previous_text': True,
            # ... other parameters ...
        }
        if prompt:
            inputs['prompt'] = prompt

        result = config.version.predict(**inputs)
        if result['segments'] and result['segments'][0]['no_speech_prob'] < 0.5:
            return result['transcription']
    return None

This pipeline used OpenAI's Whisper model for transcription, with some clever optimisations like voice activity detection to improve accuracy and efficiency. It could handle various languages and even used prompts to improve context understanding.

The Brain: AI Response Generation

fwrog-e's charm came from its ability to generate contextual and engaging responses. Here's a peek at its brain:

async def get_ai_response(self, websocket):
    transcription = await self.reasoningQueue.get()
    history_array = []
    chat_agent = await sync_to_async(ChatAgent)(history_array=history_array)

    try:
        reply = chat_agent.agent_executor.run(input=transcription)
        # Extract language and clean up reply
        pattern = r'\(([a-z]{2}-[A-Z]{2})\)'
        match = re.search(pattern, reply)
        language = 'en-US'  # default
        if match:
            language = match.group(1)
            reply = re.sub(pattern, '', reply)

        data = {
            "status": "broadcasting",
            "reasoning": reply.strip(),
        }
        await self.socketManager.broadcast(websocket, data)
    except ValueError as inst:
        print('ValueError:\n')
        print(inst)
        reply = "Sorry, there was an error processing your request."

    self.reasoningQueue.task_done()

This function used a combination of langchain and OpenAI's GPT model to generate responses, taking into account the conversation history and even detecting the language of the response.

The Secret Sauce: Prompt Engineering 101

The real magic of fwrog-e was in its "brain training." Here's a peek at what made it tick:

suffix = f"""
The person's name that you are interacting with is {human_prefix}.
Please be entertaining and respectful towards them. The current date is {date}.
Questions that refer to a specific date or time period will be interpreted relative to
this date. After you answer the question, you MUST
determine which language your answer is written in, and append
the language code to the end of the Final Answer, within
parentheses, like this (en-US). Begin! Previous conversation
history:
{{ chat_history }}
New input: {{ input }}
{{ agent_scratchpad }}
"""
  • Personal Touch: Used {human_prefix} to make each chat feel unique to the user.
  • Time-Aware: {date} kept fwrog-e living in the now.
  • Memory: {{ chat_history }} helped fwrog-e remember our chats.
  • Polyglot: The language code bit let fwrog-e switch languages on the fly.
  • Personality Plus: Telling it to "be entertaining" gave fwrog-e its charm.
  • Think It Through: {{ agent_scratchpad }} let fwrog-e mull things over before replying.

This setup was my attempt to balance specific instructions with enough wiggle room for fun, varied conversations.

The Hiccups and What-Ifs

The big dream? Running fwrog-e entirely on your device. But back then, the tech just wasn't there for smooth sailing. Performance issues and privacy concerns were real head-scratchers from the get-go.

Some of the tougher nuts to crack:

  • Speeding up the speech-to-text magic
  • Finding the sweet spot between a smart AI and quick replies
  • Dealing with spotty internet and API hiccups
  • Trying to figure out ways to shift if not the whole stack atleast a majority of it to the device

Personal downfalls:

  • Overcomplicating things (oops!) trying to be on the most cutting edge of technologies that were in hindsight too new for the job (double oops! looking at you SolidJS...)
  • Not focusing on the core experience and trying to do too much at once

Note to self: Keep it simple, don't reinvent the wheel, and always have a backup plan!

Despite these advanced features, fwrog-e faced several challenges:

  1. Performance optimisation: Balancing the complexity of the AI models with real-time response requirements was a constant struggle.
  2. Error handling: As you can see in the code, robust error handling was crucial to maintain a smooth user experience.
  3. Scalability: While the WebSocket implementation allowed for multiple users, scaling this to hundreds or thousands of simultaneous users would require additional architectural considerations.

Hindsight is 20/20

fwrog-e was a fun ride, but it never quite made it out of the test tube. These days, with tech like whisper.cpp, slimmed-down Whisper models, and tiny AI brains that can run on your phone (thanks to llama.cpp), a fully on-device fwrog-e doesn't seem so far-fetched anymore.

Who knows? Maybe it's time to dust off the old frog and give it another go. For now, it's a fond reminder of a project that was just a hop, skip, and a jump ahead of its time.

Fancy a deeper dive or want to breathe new life into fwrog-e yourself? The code's all yours to play with! Hop over to GitHub and have at it: https://github.com/Beenyaa/fwrog-e (beware though there's little to no documentation so you will need to do some digging, reverse engineering and debugging to get it up and running. If you do take on this challenge, I am also an email or DM away!)

Last Ribbit

This project last hopped in 2022 and to this day, not many projects have managed to blend visual interaction with conversational AI quite like fwrog-e's ambitious vision. Closest a company has come to this was Rabbit with their R1 product, but if you keep up with the AI news that has turned into a dumpster fire of a company.

All in all, fwrog-e was a fun experiment that pushed the boundaries of what was and is possible with AI and real-time communication. It was a great learning experience and a reminder that sometimes the best projects are the ones that push you out of your comfort zone. Given the advancements in AI and real-time communication, it's exciting to think about what the future holds for projects like fwrog-e. Who knows? Maybe one day we'll see fwrog-e hopping around again, better than ever.

Until then, keep on hopping!

handdrawn cloud
handdrawn cloud

Connect

Ready to work together on your next project?

Let's make it happen!

start new project ✨
Landscape handdrawn scenery