Picture this: a virtual pet frog that could actually chat with you. That was fwrog-e in a nutshell - my crack at mixing fun visuals with some clever AI trickery. It was a wild ride, full of "aha" moments and head-scratching challenges.
While fwrog-e never quite became the chatty amphibian of my dreams (blame it on the learning curves, some pesky tech limitations and my own ambition), it was one heck of a learning experience. Let's dive into this ribbiting adventure!
I wanted to create something simple yet engaging:
Easy enough right? Well, not quite. But hey, that's half the fun!
In a nuthsell here's how I pieced fwrog-e together.
fwrog-e was more than just a pretty face. Its inner workings were a complex dance of various technologies. Let's dive into some of the key components:
At the heart of fwrog-e was a WebSocket server that enabled real-time, bidirectional communication. This allowed for smooth, instant interactions between the user and our amphibious friend.
@app.websocket("/")
async def whispers(websocket: WebSocket):
try:
user_id = generate_unique_id()
queues[user_id + "transcription"] = asyncio.Queue()
queues[user_id + "reasoning"] = asyncio.Queue()
socketManagers[user_id] = SocketManager()
whisperings[user_id] = WhispersSession(
queues[user_id + "transcription"], queues[user_id + "reasoning"], socketManagers[user_id])
await socketManagers[user_id].connect(websocket)
# Main WebSocket loop
while True:
recording = await websocket.receive_json()
recording = recording['audio_data']
await queues[user_id + "transcription"].put(recording)
whisper_task = asyncio.create_task(
whisperings[user_id].process_audio_data_from_queue(websocket))
response_task = asyncio.create_task(
whisperings[user_id].get_ai_response(websocket))
except WebSocketDisconnect:
# Handle disconnection
socketManagers[user_id].disconnect(websocket)
# Clean up resources
This setup allowed for handling multiple users simultaneously, each with their own unique session.
fwrog-e wasn't just a good listener; it was an excellent one, thanks to its sophisticated audio processing pipeline.
async def process_audio_data(audio_data: Array[np.int_], prompt: str = None) -> str:
config = __AudioProcessingConfig()
audioFloat32 = _int2float(audio_data)
new_confidence = config.vad_model(torch.from_numpy(audioFloat32), 16000).item()
if new_confidence > 0.5:
write("example.wav", 16000, audio_data.astype(np.int16))
inputs = {
'audio': open("example.wav", "rb"),
'model': "small",
'transcription': "plain text",
'translate': False,
'language': 'en',
'temperature': 0,
'suppress_tokens': "-1",
'condition_on_previous_text': True,
# ... other parameters ...
}
if prompt:
inputs['prompt'] = prompt
result = config.version.predict(**inputs)
if result['segments'] and result['segments'][0]['no_speech_prob'] < 0.5:
return result['transcription']
return None
This pipeline used OpenAI's Whisper model for transcription, with some clever optimisations like voice activity detection to improve accuracy and efficiency. It could handle various languages and even used prompts to improve context understanding.
fwrog-e's charm came from its ability to generate contextual and engaging responses. Here's a peek at its brain:
async def get_ai_response(self, websocket):
transcription = await self.reasoningQueue.get()
history_array = []
chat_agent = await sync_to_async(ChatAgent)(history_array=history_array)
try:
reply = chat_agent.agent_executor.run(input=transcription)
# Extract language and clean up reply
pattern = r'\(([a-z]{2}-[A-Z]{2})\)'
match = re.search(pattern, reply)
language = 'en-US' # default
if match:
language = match.group(1)
reply = re.sub(pattern, '', reply)
data = {
"status": "broadcasting",
"reasoning": reply.strip(),
}
await self.socketManager.broadcast(websocket, data)
except ValueError as inst:
print('ValueError:\n')
print(inst)
reply = "Sorry, there was an error processing your request."
self.reasoningQueue.task_done()
This function used a combination of langchain and OpenAI's GPT model to generate responses, taking into account the conversation history and even detecting the language of the response.
The real magic of fwrog-e was in its "brain training." Here's a peek at what made it tick:
suffix = f"""
The person's name that you are interacting with is {human_prefix}.
Please be entertaining and respectful towards them. The current date is {date}.
Questions that refer to a specific date or time period will be interpreted relative to
this date. After you answer the question, you MUST
determine which language your answer is written in, and append
the language code to the end of the Final Answer, within
parentheses, like this (en-US). Begin! Previous conversation
history:
{{ chat_history }}
New input: {{ input }}
{{ agent_scratchpad }}
"""
{human_prefix}
to make each chat feel unique to the user.{date}
kept fwrog-e living in the now.{{ chat_history }}
helped fwrog-e remember our chats.{{ agent_scratchpad }}
let fwrog-e mull things over before replying.This setup was my attempt to balance specific instructions with enough wiggle room for fun, varied conversations.
The big dream? Running fwrog-e entirely on your device. But back then, the tech just wasn't there for smooth sailing. Performance issues and privacy concerns were real head-scratchers from the get-go.
Some of the tougher nuts to crack:
Personal downfalls:
Note to self: Keep it simple, don't reinvent the wheel, and always have a backup plan!
Despite these advanced features, fwrog-e faced several challenges:
fwrog-e was a fun ride, but it never quite made it out of the test tube. These days, with tech like whisper.cpp, slimmed-down Whisper models, and tiny AI brains that can run on your phone (thanks to llama.cpp), a fully on-device fwrog-e doesn't seem so far-fetched anymore.
Who knows? Maybe it's time to dust off the old frog and give it another go. For now, it's a fond reminder of a project that was just a hop, skip, and a jump ahead of its time.
Fancy a deeper dive or want to breathe new life into fwrog-e yourself? The code's all yours to play with! Hop over to GitHub and have at it: https://github.com/Beenyaa/fwrog-e (beware though there's little to no documentation so you will need to do some digging, reverse engineering and debugging to get it up and running. If you do take on this challenge, I am also an email or DM away!)
This project last hopped in 2022 and to this day, not many projects have managed to blend visual interaction with conversational AI quite like fwrog-e's ambitious vision. Closest a company has come to this was Rabbit with their R1 product, but if you keep up with the AI news that has turned into a dumpster fire of a company.
All in all, fwrog-e was a fun experiment that pushed the boundaries of what was and is possible with AI and real-time communication. It was a great learning experience and a reminder that sometimes the best projects are the ones that push you out of your comfort zone. Given the advancements in AI and real-time communication, it's exciting to think about what the future holds for projects like fwrog-e. Who knows? Maybe one day we'll see fwrog-e hopping around again, better than ever.
Until then, keep on hopping!