This setup is possible even with an RTX 4070. It allows for a fully local (Local LLM) Slack translation and knowledge bot

Overview

This article discusses the use of experimental Local LLMs. This article will be updated regularly.

This approach can be useful in situations where you don't want to pay for cloud LLMs, or when you need to work with information that can't be uploaded to general generative AI, such as company or personal information.

This is possible because you can build an AI bot for Slack using your own PC. I'll demonstrate how to use LLMs, even with a relatively small amount of GPU RAM, such as 8GB on an RTX 4070. Specifically, the "TranslateGemma:4b" model, which has a small model size and a very small quantization bit (4 bits), can run with only around 4GB of GPU RAM.

Built environment

Incorporating the specified Python version (3.12) and the use of venv for the development environment:

"This project will be built using Python version 3.12.

You'll need to install the following modules:

slack-bolt
requests

I'll be using a venv environment to keep things simple and avoid potential conflicts with other projects. This ensures that the project has its own isolated dependencies."

App settings

We will be using Ollama to access the Local LLM.

Please install Ollama and any other LLMs you wish to use. For this example, I'll be using the "translategemma:4b" model. You can find it here:
https://ollama.com/

You'll also need to set up a bot and API key in your Slack workspace.

Slack API Configuration (Most Important Points)

Instead of allowing external access to your home PC, we will use Socket Mode to connect to Slack. This allows your home PC to "reach out" to Slack.

① App Creation

In the Slack API, click "Create New App" → "From scratch".
In the left menu, go to [Settings] > [Basic Information] and find the "App-level Tokens" section at the bottom.

You can give the app any name. Add connections:write to the "Scopes".
Make a note of the xapp- prefix token.

② Permission Settings

In [Features] > [OAuth & Permissions], add the following to the "Bot Token Scopes" section:

app_mentions:read: To respond to mentions.
chat:write: To post replies.

Click "Install to Workspace" and make a note of the xoxb- prefix token.

③ Enable Socket Mode and Event Subscriptions

In [Settings] > [Socket Mode], click "Enable Socket Mode".
In [Features] > [Event Subscriptions], click "Enable Events".

Add app_mention to "Subscribe to bot events" and save.

Code

Assuming you have Python 3.12 installed on your Windows machine. If not, please install it from the official website.

Create a new project folder within your user's Documents folder, and create a Python virtual environment inside that folder.

cd ~\documents
mkdir local-llm-slackbot
cd local-llm-slackbot
py -3.12 -m venv venv

cd ~\documents
mkdir local-llm-slackbot
cd local-llm-slackbot
py -3.12 -m venv venv

Next, activate the .venv environment and install the Slack official framework and the requests library using pip.

.\venv\Scripts\activate
py -m pip install --upgrade pip
pip install slack-bolt requests
deactivate
explorer .

.\venv\Scripts\activate
py -m pip install --upgrade pip
pip install slack-bolt requests
deactivate
explorer .

Now, make sure the folder named local-llm-slackbot is open in your terminal.

Create a new Python file named app.py inside that folder, and paste the following code into it. Then, paste the two Slack tokens into the appropriate places.

SLACK_BOT_TOKEN: This should start with "xoxb-".
SLACK_APP_TOKEN: This should start with "xapp-".

Finally, replace "OLLAMA_MODEL" with the name of the LLM model you are using via Ollama. You will launch it from the terminal."

import requests
from slack_bolt import App
from slack_bolt.adapter.socket_mode import SocketModeHandler

# ==========================================
# 1. Token declaration (rewrite this)
# ==========================================
SLACK_BOT_TOKEN = "xoxb-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
SLACK_APP_TOKEN = "xapp-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
OLLAMA_MODEL = "translategemma:4b"  # Model name to be used

# Initializing the App
app = App(token=SLACK_BOT_TOKEN)

# ==========================================
# 2. History management dictionary
# { "user ID": [message log list] }
# ==========================================
user_histories = {}

@app.event("app_mention")
def handle_mention(event, say):
    user_id = event['user']
    # mention（<@U...>）Remove to get the pure question
    raw_text = event['text']
    user_query = raw_text.split('> ')[-1] if '> ' in raw_text else raw_text

    # If there is no history for that user, initialize
    if user_id not in user_histories:
        user_histories[user_id] = []

    # Add user's question to history
    user_histories[user_id].append({"role": "user", "content": user_query})

    # For debugging: show who is talking to you in the terminal
    print(f"User {user_id} says: {user_query}")

    # Ollama to Requests
    try:
        # Understanding the "context" by sending the entire history as messages
        response = requests.post(
            'http://localhost:11434/api/chat',
            json={
                "model": OLLAMA_MODEL,
                "messages": user_histories[user_id],
                "stream": False
            },
            timeout=60  # Set response waiting time
        )
        response.raise_for_status()
        
        # AIの回答を取得
        result = response.json()
        answer = result.get('message', {}).get('content', "Sorry, we were unable to generate an answer.")

        # AI answers are also added to the history (to provide context for the next conversation)
        user_histories[user_id].append({"role": "assistant", "content": answer})

        # 【Token Saving and Memory Management】
        # If the history is too long, it will put pressure on VRAM and slow down inference, so it is limited to the most recent 10 round trips.
        if len(user_histories[user_id]) > 20:
            user_histories[user_id] = user_histories[user_id][-20:]

        # Slack to Reply
        say(f"<@{user_id}> \n{answer}")

    except Exception as e:
        error_msg = f"An error has occurred: {str(e)}"
        print(error_msg)
        say(error_msg)

# ==========================================
# 3. 実行
# ==========================================
if __name__ == "__main__":
    print(f"⚡️ Bolt app is running with model: {OLLAMA_MODEL}")
    handler = SocketModeHandler(app, SLACK_APP_TOKEN)
    handler.start()

import requests
from slack_bolt import App
from slack_bolt.adapter.socket_mode import SocketModeHandler

# ==========================================
# 1. Token declaration (rewrite this)
# ==========================================
SLACK_BOT_TOKEN = "xoxb-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
SLACK_APP_TOKEN = "xapp-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
OLLAMA_MODEL = "translategemma:4b"  # Model name to be used

# Initializing the App
app = App(token=SLACK_BOT_TOKEN)

# ==========================================
# 2. History management dictionary
# { "user ID": [message log list] }
# ==========================================
user_histories = {}

@app.event("app_mention")
def handle_mention(event, say):
    user_id = event['user']
    # mention（<@U...>）Remove to get the pure question
    raw_text = event['text']
    user_query = raw_text.split('> ')[-1] if '> ' in raw_text else raw_text

    # If there is no history for that user, initialize
    if user_id not in user_histories:
        user_histories[user_id] = []

    # Add user's question to history
    user_histories[user_id].append({"role": "user", "content": user_query})

    # For debugging: show who is talking to you in the terminal
    print(f"User {user_id} says: {user_query}")

    # Ollama to Requests
    try:
        # Understanding the "context" by sending the entire history as messages
        response = requests.post(
            'http://localhost:11434/api/chat',
            json={
                "model": OLLAMA_MODEL,
                "messages": user_histories[user_id],
                "stream": False
            },
            timeout=60  # Set response waiting time
        )
        response.raise_for_status()
        
        # AIの回答を取得
        result = response.json()
        answer = result.get('message', {}).get('content', "Sorry, we were unable to generate an answer.")

        # AI answers are also added to the history (to provide context for the next conversation)
        user_histories[user_id].append({"role": "assistant", "content": answer})

        # 【Token Saving and Memory Management】
        # If the history is too long, it will put pressure on VRAM and slow down inference, so it is limited to the most recent 10 round trips.
        if len(user_histories[user_id]) > 20:
            user_histories[user_id] = user_histories[user_id][-20:]

        # Slack to Reply
        say(f"<@{user_id}> \n{answer}")

    except Exception as e:
        error_msg = f"An error has occurred: {str(e)}"
        print(error_msg)
        say(error_msg)

# ==========================================
# 3. 実行
# ==========================================
if __name__ == "__main__":
    print(f"⚡️ Bolt app is running with model: {OLLAMA_MODEL}")
    handler = SocketModeHandler(app, SLACK_APP_TOKEN)
    handler.start()

使い方

Open your terminal.

Start the Ollama instance and launch your app.

ollama run translategemma:4b

ollama run translategemma:4b

Keep the Ollama terminal open, and then open a new terminal window to launch your app.

cd ~\documents\local-llm-slackbot
.\venv\Scripts\activate
python app.py

cd ~\documents\local-llm-slackbot
.\venv\Scripts\activate
python app.py

Once your app is running, you'll see a message in the terminal that says something like: "Bolt app is running!"

Now you can start communicating with your local LLM-powered Slack bot from your Slack workspace.

In your Slack workspace, invite the bot you created to the desired channel.

When you mention the bot in a Slack channel, it should respond.

The "translategemma:4b" model, with its 4-bit quantization, requires relatively little GPU RAM. Furthermore, it's specifically designed for translation, resulting in good translation quality. Also, the script is configured to handle responses on a per-user basis.

When you want to shut down Ollama, type the following command in the terminal

ollama /bye

ollama /bye

To exit the app.py script, you can use Ctrl+C to return to the .venv environment.

This setup is possible even with an RTX 4070. It allows for a fully local (Local LLM) Slack translation and knowledge bot

TOC

Overview

Built environment

App settings

Slack API Configuration (Most Important Points)

① App Creation

② Permission Settings

③ Enable Socket Mode and Event Subscriptions

Code

使い方

Language Select

ad

Category

ad

New Post

ad

ad

アーカイブ

カテゴリー

Overview

Built environment

App settings

Slack API Configuration (Most Important Points)

① App Creation

② Permission Settings

③ Enable Socket Mode and Event Subscriptions

Code

使い方

Language Select

Tag

ad

Category

ad

New Post

ad

ad

アーカイブ

カテゴリー

Support Our Content