Complete Refactor
This commit is contained in:
@@ -1,154 +1,71 @@
|
||||
# CSM
|
||||
# csm-conversation-bot
|
||||
|
||||
**2025/03/13** - We are releasing the 1B CSM variant. The checkpoint is [hosted on Hugging Face](https://huggingface.co/sesame/csm_1b).
|
||||
## Overview
|
||||
The CSM Conversation Bot is an application that utilizes advanced audio processing and language model technologies to facilitate real-time voice conversations with an AI assistant. The bot processes audio streams, converts spoken input into text, generates responses using the Llama 3.2 model, and converts the text back into audio for seamless interaction.
|
||||
|
||||
---
|
||||
|
||||
CSM (Conversational Speech Model) is a speech generation model from [Sesame](https://www.sesame.com) that generates RVQ audio codes from text and audio inputs. The model architecture employs a [Llama](https://www.llama.com/) backbone and a smaller audio decoder that produces [Mimi](https://huggingface.co/kyutai/mimi) audio codes.
|
||||
|
||||
A fine-tuned variant of CSM powers the [interactive voice demo](https://www.sesame.com/voicedemo) shown in our [blog post](https://www.sesame.com/research/crossing_the_uncanny_valley_of_voice).
|
||||
|
||||
A hosted [Hugging Face space](https://huggingface.co/spaces/sesame/csm-1b) is also available for testing audio generation.
|
||||
|
||||
## Requirements
|
||||
|
||||
* A CUDA-compatible GPU
|
||||
* The code has been tested on CUDA 12.4 and 12.6, but it may also work on other versions
|
||||
* Similarly, Python 3.10 is recommended, but newer versions may be fine
|
||||
* For some audio operations, `ffmpeg` may be required
|
||||
* Access to the following Hugging Face models:
|
||||
* [Llama-3.2-1B](https://huggingface.co/meta-llama/Llama-3.2-1B)
|
||||
* [CSM-1B](https://huggingface.co/sesame/csm-1b)
|
||||
|
||||
### Setup
|
||||
|
||||
```bash
|
||||
git clone git@github.com:SesameAILabs/csm.git
|
||||
cd csm
|
||||
python3.10 -m venv .venv
|
||||
source .venv/bin/activate
|
||||
pip install -r requirements.txt
|
||||
|
||||
# Disable lazy compilation in Mimi
|
||||
export NO_TORCH_COMPILE=1
|
||||
|
||||
# You will need access to CSM-1B and Llama-3.2-1B
|
||||
huggingface-cli login
|
||||
## Project Structure
|
||||
```
|
||||
csm-conversation-bot
|
||||
├── api
|
||||
│ ├── app.py # Main entry point for the API
|
||||
│ ├── routes.py # Defines API routes
|
||||
│ └── socket_handlers.py # Manages Socket.IO events
|
||||
├── src
|
||||
│ ├── audio
|
||||
│ │ ├── processor.py # Audio processing functions
|
||||
│ │ └── streaming.py # Audio streaming management
|
||||
│ ├── llm
|
||||
│ │ ├── generator.py # Response generation using Llama 3.2
|
||||
│ │ └── tokenizer.py # Text tokenization functions
|
||||
│ ├── models
|
||||
│ │ ├── audio_model.py # Audio processing model
|
||||
│ │ └── conversation.py # Conversation state management
|
||||
│ ├── services
|
||||
│ │ ├── transcription_service.py # Audio to text conversion
|
||||
│ │ └── tts_service.py # Text to speech conversion
|
||||
│ └── utils
|
||||
│ ├── config.py # Configuration settings
|
||||
│ └── logger.py # Logging utilities
|
||||
├── static
|
||||
│ ├── css
|
||||
│ │ └── styles.css # CSS styles for the web interface
|
||||
│ ├── js
|
||||
│ │ └── client.js # Client-side JavaScript
|
||||
│ └── index.html # Main HTML file for the web interface
|
||||
├── templates
|
||||
│ └── index.html # Template for rendering the main HTML page
|
||||
├── config.py # Main configuration settings
|
||||
├── requirements.txt # Python dependencies
|
||||
├── server.py # Entry point for running the application
|
||||
└── README.md # Documentation for the project
|
||||
```
|
||||
|
||||
### Windows Setup
|
||||
## Installation
|
||||
1. Clone the repository:
|
||||
```
|
||||
git clone https://github.com/yourusername/csm-conversation-bot.git
|
||||
cd csm-conversation-bot
|
||||
```
|
||||
|
||||
The `triton` package cannot be installed in Windows. Instead use `pip install triton-windows`.
|
||||
2. Install the required dependencies:
|
||||
```
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
## Quickstart
|
||||
|
||||
This script will generate a conversation between 2 characters, using a prompt for each character.
|
||||
|
||||
```bash
|
||||
python run_csm.py
|
||||
```
|
||||
3. Configure the application settings in `config.py` as needed.
|
||||
|
||||
## Usage
|
||||
1. Start the server:
|
||||
```
|
||||
python server.py
|
||||
```
|
||||
|
||||
If you want to write your own applications with CSM, the following examples show basic usage.
|
||||
2. Open your web browser and navigate to `http://localhost:5000` to access the application.
|
||||
|
||||
#### Generate a sentence
|
||||
3. Use the interface to start a conversation with the AI assistant.
|
||||
|
||||
This will use a random speaker identity, as no prompt or context is provided.
|
||||
## Contributing
|
||||
Contributions are welcome! Please submit a pull request or open an issue for any enhancements or bug fixes.
|
||||
|
||||
```python
|
||||
from generator import load_csm_1b
|
||||
import torchaudio
|
||||
import torch
|
||||
|
||||
if torch.backends.mps.is_available():
|
||||
device = "mps"
|
||||
elif torch.cuda.is_available():
|
||||
device = "cuda"
|
||||
else:
|
||||
device = "cpu"
|
||||
|
||||
generator = load_csm_1b(device=device)
|
||||
|
||||
audio = generator.generate(
|
||||
text="Hello from Sesame.",
|
||||
speaker=0,
|
||||
context=[],
|
||||
max_audio_length_ms=10_000,
|
||||
)
|
||||
|
||||
torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)
|
||||
```
|
||||
|
||||
#### Generate with context
|
||||
|
||||
CSM sounds best when provided with context. You can prompt or provide context to the model using a `Segment` for each speaker's utterance.
|
||||
|
||||
NOTE: The following example is instructional and the audio files do not exist. It is intended as an example for using context with CSM.
|
||||
|
||||
```python
|
||||
from generator import Segment
|
||||
|
||||
speakers = [0, 1, 0, 0]
|
||||
transcripts = [
|
||||
"Hey how are you doing.",
|
||||
"Pretty good, pretty good.",
|
||||
"I'm great.",
|
||||
"So happy to be speaking to you.",
|
||||
]
|
||||
audio_paths = [
|
||||
"utterance_0.wav",
|
||||
"utterance_1.wav",
|
||||
"utterance_2.wav",
|
||||
"utterance_3.wav",
|
||||
]
|
||||
|
||||
def load_audio(audio_path):
|
||||
audio_tensor, sample_rate = torchaudio.load(audio_path)
|
||||
audio_tensor = torchaudio.functional.resample(
|
||||
audio_tensor.squeeze(0), orig_freq=sample_rate, new_freq=generator.sample_rate
|
||||
)
|
||||
return audio_tensor
|
||||
|
||||
segments = [
|
||||
Segment(text=transcript, speaker=speaker, audio=load_audio(audio_path))
|
||||
for transcript, speaker, audio_path in zip(transcripts, speakers, audio_paths)
|
||||
]
|
||||
audio = generator.generate(
|
||||
text="Me too, this is some cool stuff huh?",
|
||||
speaker=1,
|
||||
context=segments,
|
||||
max_audio_length_ms=10_000,
|
||||
)
|
||||
|
||||
torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)
|
||||
```
|
||||
|
||||
## FAQ
|
||||
|
||||
**Does this model come with any voices?**
|
||||
|
||||
The model open-sourced here is a base generation model. It is capable of producing a variety of voices, but it has not been fine-tuned on any specific voice.
|
||||
|
||||
**Can I converse with the model?**
|
||||
|
||||
CSM is trained to be an audio generation model and not a general-purpose multimodal LLM. It cannot generate text. We suggest using a separate LLM for text generation.
|
||||
|
||||
**Does it support other languages?**
|
||||
|
||||
The model has some capacity for non-English languages due to data contamination in the training data, but it likely won't do well.
|
||||
|
||||
## Misuse and abuse ⚠️
|
||||
|
||||
This project provides a high-quality speech generation model for research and educational purposes. While we encourage responsible and ethical use, we **explicitly prohibit** the following:
|
||||
|
||||
- **Impersonation or Fraud**: Do not use this model to generate speech that mimics real individuals without their explicit consent.
|
||||
- **Misinformation or Deception**: Do not use this model to create deceptive or misleading content, such as fake news or fraudulent calls.
|
||||
- **Illegal or Harmful Activities**: Do not use this model for any illegal, harmful, or malicious purposes.
|
||||
|
||||
By using this model, you agree to comply with all applicable laws and ethical guidelines. We are **not responsible** for any misuse, and we strongly condemn unethical applications of this technology.
|
||||
|
||||
---
|
||||
|
||||
## Authors
|
||||
Johan Schalkwyk, Ankit Kumar, Dan Lyth, Sefik Emre Eskimez, Zack Hodari, Cinjon Resnick, Ramon Sanabria, Raven Jiang, and the Sesame team.
|
||||
## License
|
||||
This project is licensed under the MIT License. See the LICENSE file for more details.
|
||||
Reference in New Issue
Block a user