使用OpenAI分析私有数据

Web Q&A with Embeddings

Learn how to crawl your website and build a Q/A bot with the OpenAI API. You can find the full tutorial in the OpenAI documentation.

一. 配置python环境

# 创建虚拟环境
zrf@debian:~/git/web-crawl-q-and-a-example$ python3 -m venv openai-env

# 激活环境
zrf@debian:~/git/web-crawl-q-and-a-example$ source openai-env/bin/activate
(openai-env) zrf@debian:~/git/web-crawl-q-and-a-example$ 

# 安装openai python库(使用清华镜像源)
(openai-env) zrf@debian:~/git/web-crawl-q-and-a-example$ pip install openai  -i https://pypi.tuna.tsinghua.edu.cn/simple

二. 配置本地APIKEY

# 忽略该文件
(openai-env) zrf@debian:~/git/web-crawl-q-and-a-example$ cat .gitignore 
.env
openai-env

# 添加自己的APIKEY
(openai-env) zrf@debian:~/git/web-crawl-q-and-a-example$ cat .env 
# Once you add your API key below, make sure to not share it with anyone! The API key should remain private.
OPENAI_API_KEY=abc123

# 如果你的KEY需要代理的话需要配置以下环境变量
OPENAI_API_BASE_URL=

# 配置环境变量
(openai-env) zrf@debian:~/git/web-crawl-q-and-a-example$ source .env 

# 测试环境变量
(openai-env) zrf@debian:~/git/web-crawl-q-and-a-example$ echo $OPENAI_API_KEY
abc123

可以通过以下代码导入自己的KEY和代理地址

from openai import OpenAI

client = OpenAI()
# defaults to getting the key using os.environ.get("OPENAI_API_KEY")
# if you saved the key under a different environment variable name, you can do something like:
# client = OpenAI(
#   api_key=os.environ.get("CUSTOM_ENV_NAME"),
# )

三. 一个小测试

测试代码

import os
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()

client = OpenAI(
    api_key=os.environ.get("OPENAI_API_KEY"),
    base_url=os.environ.get("OPENAI_API_BASE_URL"),
)

completion = client.chat.completions.create(
  model="gpt-3.5-turbo",
  messages=[
    {"role": "system", "content": "你擅长中英文翻译"},
    {"role": "user", "content": "将'stay hungry, stay foolish'翻译为中文"}
  ]
)

print(completion.choices[0].message.content)

测试结果

1
2
3

(openai-env) zrf@debian:~/git/web-crawl-q-and-a-example$ source .env 
(openai-env) zrf@debian:~/git/web-crawl-q-and-a-example$ python3 openai-test.py 
"stay hungry, stay foolish" 的中文翻译是 "求知若饥，虚心若愚"。

三. 让AI分析私有数据

本节我们将让OpenAI分析本地的私有数据,我们通过提问的方式,来检测他的分析是否正确。

3.1 创建私有数据文档

我将先使用中文私有数据并使用中文提问,再换为英文私有数据并用英文提问

# 中文
今年是2023年
我叫zrf，女，今年18岁了。
zrf有一个异性的朋友，1998年出生。

# 英文
This year is 2023.
My name is zrf, female, and I'm 18 years old this year.
zrf has a friend of the opposite sex who was born in 1998.

3.2 生成嵌入向量

看一下生成的嵌入向量

(openai-env) zrf@debian:~/git/web-crawl-q-and-a-example$ cat processed/embeddings.csv
,text,n_tokens,embeddings
0,. 今年是2023年 我叫zrf，女，今年18岁了。 zrf有一个异性的朋友，1998年出生。 ,39,   \
"[-0.025911948, -0.027703455, -0.0065136873,....很长,不贴了]"

(openai-env) zrf@debian:~/git/web-crawl-q-and-a-example$ cat processed/embeddings.csv
,text,n_tokens,embeddings
0,". This year is 2023. My name is zrf, female, and I'm 18 years old this year. zrf has a friend of the opposite sex who was born in 1998.",43,\
"[-0.027196646, -0.022042233, -0.010211693, 0.0008644648, -0.0007859507, 0.0085734185, -0.037790384, -0.011888819, -0.026989432, -0.026393697, 0.024930257, 0.011176526, 0.0037103994,....很长,不贴了]"

3.3 提问

1
2

print(answer_question(df, question="zrf的朋友是男性还是女性,zrf的朋友比zrf大几岁"))
print(answer_question(df, question="zrf's friend is male or female, and how many years older is zrf's friend than zrf?"))

3.4 回答

(openai-env) zrf@debian:~/git/web-crawl-q-and-a-example$ python3 web-qa.py 
zrf的朋友是男性，zrf的朋友比zrf大25岁。

(openai-env) zrf@debian:~/git/web-crawl-q-and-a-example$ python3 web-qa.py 
zrf's friend is male and is 25 years older than zrf.

为啥啊?性别对了,为啥大25岁呢?
似乎把2023减去了1998?或者说他没有理解我的提问?

四. 使用个人网站作为私有数据源进行提问

获取文本形式的数据是使用嵌入向量的第一步。本教程通过抓取zrfyun.top网站来创建一组新数据，也可以将这种技术用于自己的公司或个人网站。

4.1 安装依赖

请注意所有python环境都是在openai-env虚拟环境中完成的,注意不要装到系统python环境中.

1 2	# 告诉 pip 从指定的文件中读取要安装的包列表 pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

4.2 构建网络爬虫

该爬虫将从下面代码底部传入的根 URL 开始，访问每个页面，查找其他链接，并访问这些页面（只要它们具有相同的根域）。首先，导入所需的包，设置基本 URL，并定义 HTMLParser 类。

爬取的数据将会存储在text/zrfyun.top/下

(openai-env) zrf@debian:~/git/web-crawl-q-and-a-example$ ls -al text/zrfyun.top/
总计 1104
drwxr-xr-x 2 zrf zrf   4096 12月16日 16:01  .
drwxr-xr-x 3 zrf zrf   4096 12月16日 16:01  ..
-rw-r--r-- 1 zrf zrf   6032 12月16日 16:01  rfyun.top_2023_11_19_20231120-01.txt
-rw-r--r-- 1 zrf zrf   5921 12月16日 16:01  rfyun.top_2023_11_20_20231120-02.txt
-rw-r--r-- 1 zrf zrf  26262 12月16日 16:01  rfyun.top_2023_11_25_backtrace.txt
-rw-r--r-- 1 zrf zrf  31425 12月16日 16:01  rfyun.top_2023_12_03_20231203-01.txt


# Regex pattern to match a URL
HTTP_URL_PATTERN = r'^http[s]{0,1}://.+$'

# Define OpenAI api_key
# openai.api_key = '<Your API Key>'

# Define root domain to crawl
domain = "zrfyun.top"
full_url = "https://zrfyun.top/"

# Create a class to parse the HTML and get the hyperlinks
class HyperlinkParser(HTMLParser):
    def __init__(self):
    # Override the HTMLParser's handle_starttag method to get the hyperlinks
    def handle_starttag(self, tag, attrs):

下一个函数将 URL 作为参数，打开 URL，并读取 HTML 内容。然后，它返回在该页面上找到的所有超链接。

1
2
3

# Function to get the hyperlinks from a URL
def get_hyperlinks(url):
    return parser.hyperlinks

目标是仅抓取zrfyun.top域下的内容并为其建立索引。为此，需要一个调用 get_hyperlinks 函数但过滤掉不属于指定域的任何 URL 的函数。

# Function to get the hyperlinks from a URL that are within the same domain
def get_domain_hyperlinks(local_domain, url):

    return list(set(clean_links))

crawl函数是网页抓取任务设置的最后一步。它会跟踪访问的 URL，以避免重复同一页面，该页面可能跨网站上的多个页面链接。它还从不带 HTML 标记的页面中提取原始文本，并将文本内容写入特定于该页面的本地 .txt 文件中。

1
2
3

def crawl(url):

crawl(full_url)

上面示例的最后一行运行爬虫，它会遍历所有可访问的链接并将这些页面转换为文本文件。这将需要几分钟的时间来运行，具体取决于站点的大小和复杂性。

4.3 构建嵌入

CSV是存储嵌入的常用格式。您可以通过将原始文本文件（位于文本目录中）转换为Pandas数据帧来在 Python 中使用此格式。Pandas是一个流行的开源库，可帮助您处理表格数据（存储在行和列中的数据）。
空白行会使文本文件变得混乱并使其难以处理。一个简单的函数就可以删除这些行并整理文件。

1	def remove_newlines(serie):

将文本转换为 CSV 需要循环遍历之前创建的文本目录中的文本文件。打开每个文件后，删除多余的空格并将修改后的文本附加到列表中。然后，将删除新行的文本添加到空的 Pandas 数据框中，并将该数据框写入 CSV 文件。

额外的间距和新行会使文本变得混乱并使嵌入过程复杂化。此处使用的代码有助于删除其中一些字符，但您可能会发现第三方库或其他方法有助于删除更多不必要的字符。

最新的嵌入模型可以处理最多 8191 个输入标记的输入，因此大多数行不需要任何分块，但对于每个爬取的子页面来说可能并非如此，因此下一个代码块会将较长的行拆分为较小的块。

max_tokens = 500

# Function to split the text into chunks of a maximum number of tokens
def split_into_many(text, max_tokens = max_tokens):

    return chunks

内容现在被分解为更小的块，并且可以向 OpenAI API 发送一个简单的请求，指定使用新的 text-embedding-ada-002 模型来创建嵌入：

from openai import OpenAI

client = OpenAI(
    api_key=os.environ.get("OPENAI_API_KEY"),
)

df['embeddings'] = df.text.apply(lambda x: client.embeddings.create(input=x, engine='text-embedding-ada-002')['data'][0]['embedding'])

df.to_csv('processed/embeddings.csv')
df.head()

这应该需要一定的时间,之后我们就可以使用嵌入向量了！

嵌入用csv的格式保存在processed/下

(openai-env) zrf@debian:~/git/web-crawl-q-and-a-example$ ls -al processed/
总计 2744
drwxr-xr-x 2 zrf zrf    4096 12月16日 16:07 .
drwxr-xr-x 6 zrf zrf    4096 12月16日 16:01 ..
-rw-r--r-- 1 zrf zrf 1762206 12月16日 16:07 embeddings.csv
-rw-r--r-- 1 zrf zrf 1032224 12月16日 16:01 scraped.csv

4.4 使用嵌入构建问答系统

嵌入已准备就绪，此过程的最后一步是创建一个简单的问答系统。这将接受用户的问题，创建它的嵌入，并将其与现有嵌入进行比较，以从抓取的网站中检索最相关的文本。然后，gpt-3.5-turbo-instruct 模型将根据检索到的文本生成听起来自然的答案。

def create_context(
    question, df, max_len=1800, size="ada"
):

def answer_question(
    df,
    model="text-davinci-003",
    question="Am I allowed to publish model outputs to Twitter, without a human review?",
    max_len=1800,
    size="ada",
    debug=False,
    max_tokens=150,
    stop_sequence=None
):

print(answer_question(df, question="非对称加密算法的一个前提和两个特性是什么", debug=False))

print(answer_question(df, question="这个网站使用的http server是什么?", debug=False))

4.5 测试结果

让我们看看他的回答如何

1
2
3

(openai-env) zrf@debian:~/git/web-crawl-q-and-a-example$ python3 web-qa.py 
一个前提是需要一对公钥和私钥，两个特性是加密和解密的过程是不同的，公钥可以公开，私钥需要保密。
nginx.

结果：

用时比较久
第一问回答的不好,正确答案应该是：

一个前提：公钥可公开，私钥只有自己持有
两个特性：公钥加密、私钥解密；私钥加密、公钥解密

第二问他回答nginx.是对的