Complete Refactor

This commit is contained in:
2025-03-30 01:28:07 -04:00
parent 158afc78c7
commit 46be33b10a
28 changed files with 723 additions and 3081 deletions

View File

@@ -1,154 +1,71 @@
# CSM
# csm-conversation-bot
**2025/03/13** - We are releasing the 1B CSM variant. The checkpoint is [hosted on Hugging Face](https://huggingface.co/sesame/csm_1b).
## Overview
The CSM Conversation Bot is an application that utilizes advanced audio processing and language model technologies to facilitate real-time voice conversations with an AI assistant. The bot processes audio streams, converts spoken input into text, generates responses using the Llama 3.2 model, and converts the text back into audio for seamless interaction.
---
CSM (Conversational Speech Model) is a speech generation model from [Sesame](https://www.sesame.com) that generates RVQ audio codes from text and audio inputs. The model architecture employs a [Llama](https://www.llama.com/) backbone and a smaller audio decoder that produces [Mimi](https://huggingface.co/kyutai/mimi) audio codes.
A fine-tuned variant of CSM powers the [interactive voice demo](https://www.sesame.com/voicedemo) shown in our [blog post](https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice).
A hosted [Hugging Face space](https://huggingface.co/spaces/sesame/csm-1b) is also available for testing audio generation.
## Requirements
* A CUDA-compatible GPU
* The code has been tested on CUDA 12.4 and 12.6, but it may also work on other versions
* Similarly, Python 3.10 is recommended, but newer versions may be fine
* For some audio operations, `ffmpeg` may be required
* Access to the following Hugging Face models:
* [Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B)
* [CSM-1B](https://huggingface.co/sesame/csm-1b)
### Setup
```bash
git clone git@github.com:SesameAILabs/csm.git
cd csm
python3.10 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
# Disable lazy compilation in Mimi
export NO_TORCH_COMPILE=1
# You will need access to CSM-1B and Llama-3.2-1B
huggingface-cli login
## Project Structure
```
csm-conversation-bot
├── api
│ ├── app.py # Main entry point for the API
│ ├── routes.py # Defines API routes
│ └── socket_handlers.py # Manages Socket.IO events
├── src
│ ├── audio
│ │ ├── processor.py # Audio processing functions
│ │ └── streaming.py # Audio streaming management
│ ├── llm
│ │ ├── generator.py # Response generation using Llama 3.2
│ │ └── tokenizer.py # Text tokenization functions
│ ├── models
│ ├── audio_model.py # Audio processing model
│ └── conversation.py # Conversation state management
│ ├── services
│ │ ├── transcription_service.py # Audio to text conversion
│ │ └── tts_service.py # Text to speech conversion
│ └── utils
│ ├── config.py # Configuration settings
│ └── logger.py # Logging utilities
├── static
│ ├── css
│ │ └── styles.css # CSS styles for the web interface
│ ├── js
│ │ └── client.js # Client-side JavaScript
│ └── index.html # Main HTML file for the web interface
├── templates
│ └── index.html # Template for rendering the main HTML page
├── config.py # Main configuration settings
├── requirements.txt # Python dependencies
├── server.py # Entry point for running the application
└── README.md # Documentation for the project
```
### Windows Setup
## Installation
1. Clone the repository:
```
git clone https://github.com/yourusername/csm-conversation-bot.git
cd csm-conversation-bot
```
The `triton` package cannot be installed in Windows. Instead use `pip install triton-windows`.
2. Install the required dependencies:
```
pip install -r requirements.txt
```
## Quickstart
This script will generate a conversation between 2 characters, using a prompt for each character.
```bash
python run_csm.py
```
3. Configure the application settings in `config.py` as needed.
## Usage
1. Start the server:
```
python server.py
```
If you want to write your own applications with CSM, the following examples show basic usage.
2. Open your web browser and navigate to `http://localhost:5000` to access the application.
#### Generate a sentence
3. Use the interface to start a conversation with the AI assistant.
This will use a random speaker identity, as no prompt or context is provided.
## Contributing
Contributions are welcome! Please submit a pull request or open an issue for any enhancements or bug fixes.
```python
from generator import load_csm_1b
import torchaudio
import torch
if torch.backends.mps.is_available():
device = "mps"
elif torch.cuda.is_available():
device = "cuda"
else:
device = "cpu"
generator = load_csm_1b(device=device)
audio = generator.generate(
text="Hello from Sesame.",
speaker=0,
context=[],
max_audio_length_ms=10_000,
)
torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)
```
#### Generate with context
CSM sounds best when provided with context. You can prompt or provide context to the model using a `Segment` for each speaker's utterance.
NOTE: The following example is instructional and the audio files do not exist. It is intended as an example for using context with CSM.
```python
from generator import Segment
speakers = [0, 1, 0, 0]
transcripts = [
"Hey how are you doing.",
"Pretty good, pretty good.",
"I'm great.",
"So happy to be speaking to you.",
]
audio_paths = [
"utterance_0.wav",
"utterance_1.wav",
"utterance_2.wav",
"utterance_3.wav",
]
def load_audio(audio_path):
audio_tensor, sample_rate = torchaudio.load(audio_path)
audio_tensor = torchaudio.functional.resample(
audio_tensor.squeeze(0), orig_freq=sample_rate, new_freq=generator.sample_rate
)
return audio_tensor
segments = [
Segment(text=transcript, speaker=speaker, audio=load_audio(audio_path))
for transcript, speaker, audio_path in zip(transcripts, speakers, audio_paths)
]
audio = generator.generate(
text="Me too, this is some cool stuff huh?",
speaker=1,
context=segments,
max_audio_length_ms=10_000,
)
torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)
```
## FAQ
**Does this model come with any voices?**
The model open-sourced here is a base generation model. It is capable of producing a variety of voices, but it has not been fine-tuned on any specific voice.
**Can I converse with the model?**
CSM is trained to be an audio generation model and not a general-purpose multimodal LLM. It cannot generate text. We suggest using a separate LLM for text generation.
**Does it support other languages?**
The model has some capacity for non-English languages due to data contamination in the training data, but it likely won't do well.
## Misuse and abuse ⚠️
This project provides a high-quality speech generation model for research and educational purposes. While we encourage responsible and ethical use, we **explicitly prohibit** the following:
- **Impersonation or Fraud**: Do not use this model to generate speech that mimics real individuals without their explicit consent.
- **Misinformation or Deception**: Do not use this model to create deceptive or misleading content, such as fake news or fraudulent calls.
- **Illegal or Harmful Activities**: Do not use this model for any illegal, harmful, or malicious purposes.
By using this model, you agree to comply with all applicable laws and ethical guidelines. We are **not responsible** for any misuse, and we strongly condemn unethical applications of this technology.
---
## Authors
Johan Schalkwyk, Ankit Kumar, Dan Lyth, Sefik Emre Eskimez, Zack Hodari, Cinjon Resnick, Ramon Sanabria, Raven Jiang, and the Sesame team.
## License
This project is licensed under the MIT License. See the LICENSE file for more details.