Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sweep: Add a separate scoring function for filenames in vector_db #332

Open
wwzeng1 opened this issue Jul 7, 2023 · 1 comment · May be fixed by #796
Open

Sweep: Add a separate scoring function for filenames in vector_db #332

wwzeng1 opened this issue Jul 7, 2023 · 1 comment · May be fixed by #796
Labels
sweep Assigns Sweep to an issue or pull request.

Comments

@wwzeng1
Copy link
Contributor

wwzeng1 commented Jul 7, 2023

Description

Use a new scoring function

Relevant files

No response

@wwzeng1 wwzeng1 added the sweep Assigns Sweep to an issue or pull request. label Jul 7, 2023
@sweepai sweepai deleted a comment from sweep-nightly bot Jul 9, 2023
@wwzeng1 wwzeng1 added sweep Assigns Sweep to an issue or pull request. and removed sweep Assigns Sweep to an issue or pull request. labels Jul 9, 2023
@sweep-nightly
Copy link
Contributor

sweep-nightly bot commented Jul 9, 2023

Here's the PR! #796.

💎 Sweep Pro: I used GPT-4 to create this ticket. You have 21 GPT-4 tickets left.


Step 1: 🔍 Code Search

I found the following snippets in your repository. I will now analyze these snippets and come up with a plan.

Some code snippets I looked at (click to expand). If some file is missing from here, you can mention the path in the ticket description.

try:
github_cache_key = f"github-{commit_hash}{CACHE_VERSION}"
cache_hit = cache_inst.get(github_cache_key)
if cache_hit:
deeplake_items = json.loads(cache_hit)
logger.info(f"Cache hit for {repo_name}")
else:
deeplake_items = None
logger.info(f"Cache miss for {repo_name}")
if deeplake_items:
deeplake_vs = init_deeplake_vs(repo_name)
deeplake_vs.add(
text=deeplake_items['ids'],
embedding=deeplake_items['embeddings'],
metadata=deeplake_items['metadatas']
)
logger.info(f"Returning deeplake vs for {repo_name}")
return deeplake_vs
else:
logger.info(f"Cache for {repo_name} is empty")
except:
logger.info(f"Failed to get cache for {repo_name}")
logger.info(f"Downloading repository and indexing for {repo_name}...")
start = time.time()
logger.info("Recursively getting list of files...")
repo_url = f"https://x-access-token:{token}@github.com/{repo_name}.git"
shutil.rmtree("repo", ignore_errors=True)
branch_name = SweepConfig.get_branch(repo)
git_repo = Repo.clone_from(repo_url, "repo")
git_repo.git.checkout(branch_name)
file_list = glob.iglob("repo/**", recursive=True)
file_list = [
file
for file in tqdm(file_list)
if os.path.isfile(file)
and all(not file.endswith(ext) for ext in sweep_config.exclude_exts)
and all(not file[len("repo/"):].startswith(dir_name) for dir_name in sweep_config.exclude_dirs)
]
file_paths = []
file_contents = []
scores = []
for file in tqdm(file_list):
with open(file, "rb") as f:
is_binary = False
for block in iter(lambda: f.read(1024), b''):
if b'\0' in block:
is_binary = True
break
if is_binary:
logger.debug("Skipping binary file...")
continue
with open(file, "rb") as f:
if len(f.read()) > sweep_config.max_file_limit:
logger.debug("Skipping large file...")
continue
with open(file, "r") as f:
# Can parallelize this
try:
contents = f.read()
contents = file + contents
except UnicodeDecodeError as e:
logger.warning(f"Received warning {e}, skipping...")
continue
file_path = file[len("repo/"):]
file_paths.append(file_path)
file_contents.append(contents)
if len(file_list) > MAX_FILES:
scores.append(1)
continue
try:
cache_key = f"{repo_name}-{file_path}-{CACHE_VERSION}"
if cache_inst and cache_success:
cached_value = cache_inst.get(cache_key)
if cached_value:
score = json.loads(cached_value)
scores.append(score)
continue
commits = list(repo.get_commits(path=file_path, sha=branch_name))
score = compute_score(contents, commits)
if cache_inst and cache_success:
cache_inst.set(cache_key, json.dumps(score), ex=60 * 60 * 2)
scores.append(score)
except Exception as e:
logger.warning(f"Received warning during scoring {e}, skipping...")
scores.append(1)
continue
scores = convert_to_percentiles(scores)
chunked_results = chunker.map(file_contents, file_paths, scores, kwargs={
"additional_metadata": {"repo_name": repo_name, "branch_name": branch_name}
})
documents, metadatas, ids = zip(*chunked_results)
documents = [item for sublist in documents for item in sublist]
metadatas = [item for sublist in metadatas for item in sublist]
ids = [item for sublist in ids for item in sublist]
logger.info(f"Used {len(file_paths)} files...")
shutil.rmtree("repo", ignore_errors=True)
logger.info(f"Getting list of all files took {time.time() - start}")
logger.info(
f"Received {len(documents)} documents from repository {repo_name}")
collection_name = parse_collection_name(repo_name)
return compute_deeplake_vs(collection_name, documents, cache_success, cache_inst, ids, metadatas, commit_hash)
def compute_deeplake_vs(collection_name,
documents,
cache_success,
cache_inst,
ids,
metadatas,
sha):
deeplake_vs = init_deeplake_vs(collection_name)
if len(documents) > 0:
logger.info("Computing embeddings...")
# Check cache here for all documents
embeddings = [None] * len(documents)
if cache_inst and cache_success:
cache_keys = [hash_sha256(
doc) + SENTENCE_TRANSFORMERS_MODEL + CACHE_VERSION for doc in documents]
cache_values = cache_inst.mget(cache_keys)
for idx, value in enumerate(cache_values):
if value is not None:
embeddings[idx] = json.loads(value)
logger.info(
f"Found {len([x for x in embeddings if x is not None])} embeddings in cache")
indices_to_compute = [idx for idx,
x in enumerate(embeddings) if x is None]
documents_to_compute = [documents[idx] for idx in indices_to_compute]
computed_embeddings = embedding_function(documents_to_compute)
for idx, embedding in zip(indices_to_compute, computed_embeddings):
embeddings[idx] = embedding
deeplake_vs.add(
text=ids,
embedding=embeddings,
metadata=metadatas
)
if cache_inst and cache_success:
cache_inst.set(f"github-{sha}{CACHE_VERSION}", json.dumps(
{"metadatas": metadatas, "ids": ids, "embeddings": embeddings}))
if cache_inst and cache_success and len(documents_to_compute) > 0:
logger.info(
f"Updating cache with {len(computed_embeddings)} embeddings")
cache_keys = [hash_sha256(
doc) + SENTENCE_TRANSFORMERS_MODEL + CACHE_VERSION for doc in documents_to_compute]
cache_inst.mset({key: json.dumps(value)
for key, value in zip(cache_keys, computed_embeddings)})
return deeplake_vs
else:
logger.error("No documents found in repository")
return deeplake_vs
@stub.function(image=image, secrets=secrets, network_file_systems={DISKCACHE_DIR: model_volume}, timeout=timeout)
def update_index(
repo_name,
installation_id: int,
sweep_config: SweepConfig = SweepConfig(),
) -> int:
get_deeplake_vs_from_repo(repo_name, installation_id, branch_name=None, sweep_config=sweep_config)
# todo: ?
return 0
@stub.function(image=image, secrets=secrets, network_file_systems={DEEPLAKE_DIR: model_volume}, timeout=timeout, keep_warm=1)
def get_relevant_snippets(
repo_name: str,
query: str,
n_results: int,
installation_id: int,
username: str | None = None,
sweep_config: SweepConfig = SweepConfig(),
):
deeplake_vs = get_deeplake_vs_from_repo(
repo_name=repo_name, installation_id=installation_id, sweep_config=sweep_config
)
results = {"metadata": [], "text": []}
for n_result in range(n_results, 0, -1):
try:
query_embedding = embedding_function([query])[0]
results = deeplake_vs.search(embedding=query_embedding, k=n_result)
break
except Exception:
pass
if len(results["text"]) == 0:
if username is None:
username = "anonymous"
posthog.capture(
username,
"failed",
{
"reason": "Results query was empty",
"repo_name": repo_name,
"installation_id": installation_id,
"query": query,
"n_results": n_results
},
)
metadatas = results["metadata"]
code_scores = [metadata["score"] for metadata in metadatas]

logger.error(snippet)
def search_snippets(
self,
query: str,
installation_id: str,
num_snippets: int = 30,
) -> list[Snippet]:
get_relevant_snippets = modal.Function.lookup(DB_MODAL_INST_NAME, "get_relevant_snippets")
snippets: list[Snippet] = get_relevant_snippets.call(
self.repo.full_name,
query=query,
n_results=num_snippets,
installation_id=installation_id,
)
self.populate_snippets(snippets)
return snippets
def validate_file_change_requests(self, file_change_requests: list[FileChangeRequest], branch: str = ""):
for file_change_request in file_change_requests:
try:
contents = self.repo.get_contents(file_change_request.filename,
branch or SweepConfig.get_branch(self.repo))
if contents:
file_change_request.change_type = "modify"
else:
file_change_request.change_type = "create"
except:
file_change_request.change_type = "create"
return file_change_requests
class SweepBot(CodeGenBot, GithubBot):
def cot_retrieval(self):
# TODO(sweep): add semantic search using vector db
# TODO(sweep): add search using webpilot + github
functions = [
Function(
name="cat",
description="Cat files. Max 3 files per request.",
parameters={
"properties": {
"filepath": {
"type": "string",
"description": "Paths to files. One per line."
},
}
} # manage file too large
),
Function(
name="finish",
description="Indicate you have sufficient data to proceed.",
parameters={"properties": {}}
),
]
# self.chat(
# cot_retrieval_prompt,
# message_key="cot_retrieval",
# functions=functions,
# )
# is_function_call = self.messages[-1].function_call is not None
# for _retry in range(3):
# logger.info("Got response.")
# if not is_function_call:
# break
# response = self.messages[-1].function_call
# # response = json.loads(response)
# function_name = response["name"]
# arguments = response["arguments"]
# logger.info(f"Fetching file {function_name} with arguments {arguments}.")
# arguments = json.loads(arguments)
# if function_name == "finish":
# return
# elif function_name == "cat":
# path = arguments["filepath"]
# try:
# logger.info("Retrieving file...")
# content = self.get_file(path).decoded_content.decode("utf-8")
# logger.info("Received file")
# except github.GithubException:
# response = self.chat(
# f"File not found: {path}",
# message_key=path,
# functions=functions,
# )
# else:
# response = self.chat(
# f"Here is the file: <file path=\"{path}\">\n\n{content[:10000]}</file>. Fetch more content or call finish.",
# message_key=path,
# functions=functions
# ) # update this constant
# return response
return
def create_file(self, file_change_request: FileChangeRequest) -> FileCreation:
file_change: FileCreation | None = None
for count in range(5):
key = f"file_change_created_{file_change_request.filename}"
create_file_response = self.chat(
create_file_prompt.format(
filename=file_change_request.filename,
instructions=file_change_request.instructions,
commit_message=f"Create {file_change_request.filename}"
),
message_key=key,
)
# Add file to list of changed_files
self.file_change_paths.append(file_change_request.filename)
# self.delete_file_from_system_message(file_path=file_change_request.filename)
try:
file_change = FileCreation.from_string(create_file_response)
assert file_change is not None
file_change.commit_message = f"sweep: {file_change.commit_message[:50]}"
return file_change
except Exception:
# Todo: should we undo appending to file_change_paths?
logger.warning(f"Failed to parse. Retrying for the {count}th time...")
self.delete_messages_from_chat(key)
continue
raise Exception("Failed to parse response after 5 attempts.")
def modify_file(
self,
file_change_request: FileChangeRequest,
contents: str = "",
contents_line_numbers: str = "",
branch=None,
chunking: bool = False,
chunk_offset: int = 0,
) -> tuple[str, str]:
for count in range(5):
key = f"file_change_modified_{file_change_request.filename}"
file_markdown = is_markdown(file_change_request.filename)
# TODO(sweep): edge case at empty file
message = modify_file_prompt_3.format(
filename=file_change_request.filename,
instructions=file_change_request.instructions,
code=contents_line_numbers,
line_count=contents.count('\n') + 1
)
try:
if chunking:
message = chunking_prompt + message
modify_file_response = self.chat(
message,
message_key=key,
)
self.delete_messages_from_chat(key)
else:
modify_file_response = self.chat(
message,
message_key=key,
)
except Exception as e: # Check for max tokens error
if "max tokens" in str(e).lower():
logger.error(f"Max tokens exceeded for {file_change_request.filename}")
raise MaxTokensExceeded(file_change_request.filename)
try:
logger.info(
f"generate_new_file with contents: {contents} and modify_file_response: {modify_file_response}")
new_file = generate_new_file_from_patch(modify_file_response, contents, chunk_offset=chunk_offset)
if not is_markdown(file_change_request.filename) and not chunking:
code_repairer = CodeRepairer(chat_logger=self.chat_logger)
diff = generate_diff(old_code=contents, new_code=new_file)
if diff.strip() != "" and diff_contains_dups_or_removals(diff, new_file):
new_file = code_repairer.repair_code(diff=diff, user_code=new_file,
feature=file_change_request.instructions)
new_file = format_contents(new_file, file_markdown)
new_file = new_file.rstrip()
if contents.endswith("\n"):
new_file += "\n"
return new_file
except Exception as e:
tb = traceback.format_exc()
logger.warning(f"Failed to parse. Retrying for the {count}th time. Recieved error {e}\n{tb}")
self.delete_messages_from_chat(key)
continue
raise Exception("Failed to parse response after 5 attempts.")
def change_files_in_github(
self,
file_change_requests: list[FileChangeRequest],
branch: str,
):
# should check if branch exists, if not, create it
logger.debug(file_change_requests)
num_fcr = len(file_change_requests)
completed = 0
for file_change_request in file_change_requests:
try:
if file_change_request.change_type == "create":
self.handle_create_file(file_change_request, branch)
elif file_change_request.change_type == "modify":
self.handle_modify_file(file_change_request, branch)
except MaxTokensExceeded as e:
raise e
except Exception as e:
logger.error(f"Error in change_files_in_github {e}")
completed += 1
return completed, num_fcr
def handle_create_file(self, file_change_request: FileChangeRequest, branch: str):
try:
file_change = self.create_file(file_change_request)
file_markdown = is_markdown(file_change_request.filename)
file_change.code = format_contents(file_change.code, file_markdown)
logger.debug(
f"{file_change_request.filename}, {f'Create {file_change_request.filename}'}, {file_change.code}, {branch}"
)
self.repo.create_file(
file_change_request.filename,
file_change.commit_message,
file_change.code,
branch=branch,
)
except Exception as e:
logger.info(f"Error in handle_create_file: {e}")
def handle_modify_file(self, file_change_request: FileChangeRequest, branch: str):
CHUNK_SIZE = 400 # Number of lines to process at a time
try:
file = self.get_file(file_change_request.filename, branch=branch)
file_contents = file.decoded_content.decode("utf-8")
lines = file_contents.split("\n")
new_file_contents = "" # Initialize an empty string to hold the new file contents
all_lines_numbered = [f"{i + 1}:{line}" for i, line in enumerate(lines)]
chunking = len(lines) > CHUNK_SIZE * 1.5 # Only chunk if the file is large enough
file_name = file_change_request.filename
if not chunking:
new_file_contents = self.modify_file(
file_change_request,
contents="\n".join(lines),
branch=branch,
contents_line_numbers=file_contents if USING_DIFF else "\n".join(all_lines_numbered),
chunking=chunking,

sweep/sweepai/app/ui.py

Lines 93 to 190 in b07e4cc

path_to_contents = {}
def get_files(repo_full_name):
global path_to_contents
global repo
if repo_full_name is None:
all_files = []
else:
# Make sure repo is added to Sweep before checking all recursive files
try:
installation_id = get_installation_id(repo_full_name)
assert installation_id
except:
return []
repo = github_client.get_repo(repo_full_name)
branch_name = SweepConfig.get_branch(repo)
repo_url = f"https://x-access-token:{config.github_pat}@github.com/{repo_full_name}.git"
try:
repo_dir = os.path.join(tempfile.gettempdir(), repo_full_name)
if os.path.exists(repo_dir):
git_repo = Repo(repo_dir)
else:
git_repo = Repo.clone_from(repo_url, repo_dir)
git_repo.git.checkout(branch_name)
git_repo.remotes.origin.pull()
except Exception as e:
logger.warning(f"Git pull failed with error {e}, deleting cache and recloning...")
shutil.rmtree(repo_dir)
git_repo = Repo.clone_from(repo_url, repo_dir)
git_repo.git.checkout(branch_name)
git_repo.remotes.origin.pull()
all_files, path_to_contents = get_files_recursively(repo_dir)
return all_files
def get_files_update(*args):
global repo
if len(args) > 0:
repo = args[0]
else:
repo = config.repo_full_name
return gr.Dropdown.update(choices=get_files(repo))
def parse_response(raw_response: str) -> tuple[str, list[tuple[str, str]]]:
if "Plan:" not in raw_response:
response, raw_plan = raw_response, ""
else:
response, raw_plan = raw_response.split("Plan:", 1)
if response.startswith("Response:"):
response = response[len("Response:"):]
plan = [(line[:line.find(":")].strip(), line[line.find(":") + 1:].strip()) for line in raw_plan.split("\n*") if
line]
return response, plan
try:
user_info = api_client.get_user_info()
except Exception as e:
logger.warning(e)
user_info = {"is_paying_user": False, "remaining_tickets": 0}
global_state = config.state
with gr.Blocks(theme=gr.themes.Soft(), title="Sweep Chat", css=css) as demo:
print("Launching gradio!")
with gr.Row():
with gr.Column(scale=2):
repo_full_name = gr.Dropdown(choices=[repo.full_name for repo in repos], label="Repo full name",
value=lambda: config.repo_full_name or "")
print("Indexing files...")
with gr.Column(scale=4):
file_names = gr.Dropdown(choices=get_files(config.repo_full_name), multiselect=True, label="Files",
value=lambda: global_state.file_paths)
print("Indexed files!")
repo_full_name.change(fn=get_files_update, inputs=repo_full_name, outputs=file_names)
with gr.Column(scale=1):
restart_button = gr.Button("Restart")
with gr.Row():
with gr.Column(scale=2):
chatbot = gr.Chatbot(height=600, value=lambda: global_state.chat_history)
with gr.Column():
with gr.Row():
snippets_text = gr.Markdown(value=lambda: global_state.snippets_text, elem_id="snippets")
with gr.Row():
plan = gr.List(
value=[[filename + ": " + instructions] for filename, instructions in global_state.plan] or [[""]],
headers=["Proposed Plan"],
interactive=True,
col_count=(1, "static"),
wrap=True,
visible=global_state.plan_toggle,
)
with gr.Row():
with gr.Column(scale=8):
msg = gr.Textbox(placeholder="Send a message to Sweep", label=None, elem_id="message_box")

),
]
# self.chat(
# cot_retrieval_prompt,
# message_key="cot_retrieval",
# functions=functions,
# )
# is_function_call = self.messages[-1].function_call is not None
# for _retry in range(3):
# logger.info("Got response.")
# if not is_function_call:
# break
# response = self.messages[-1].function_call
# # response = json.loads(response)
# function_name = response["name"]
# arguments = response["arguments"]
# logger.info(f"Fetching file {function_name} with arguments {arguments}.")
# arguments = json.loads(arguments)
# if function_name == "finish":
# return
# elif function_name == "cat":
# path = arguments["filepath"]
# try:
# logger.info("Retrieving file...")
# content = self.get_file(path).decoded_content.decode("utf-8")
# logger.info("Received file")
# except github.GithubException:
# response = self.chat(
# f"File not found: {path}",
# message_key=path,
# functions=functions,
# )
# else:
# response = self.chat(
# f"Here is the file: <file path=\"{path}\">\n\n{content[:10000]}</file>. Fetch more content or call finish.",
# message_key=path,
# functions=functions
# ) # update this constant
# return response
return
def create_file(self, file_change_request: FileChangeRequest) -> FileCreation:
file_change: FileCreation | None = None
for count in range(5):
key = f"file_change_created_{file_change_request.filename}"
create_file_response = self.chat(
create_file_prompt.format(
filename=file_change_request.filename,
instructions=file_change_request.instructions,
commit_message=f"Create {file_change_request.filename}"
),
message_key=key,
)
# Add file to list of changed_files
self.file_change_paths.append(file_change_request.filename)
# self.delete_file_from_system_message(file_path=file_change_request.filename)
try:
file_change = FileCreation.from_string(create_file_response)
assert file_change is not None
file_change.commit_message = f"sweep: {file_change.commit_message[:50]}"
return file_change
except Exception:
# Todo: should we undo appending to file_change_paths?
logger.warning(f"Failed to parse. Retrying for the {count}th time...")
self.delete_messages_from_chat(key)
continue
raise Exception("Failed to parse response after 5 attempts.")
def modify_file(
self,
file_change_request: FileChangeRequest,
contents: str = "",
contents_line_numbers: str = "",
branch=None,
chunking: bool = False,
chunk_offset: int = 0,
) -> tuple[str, str]:
for count in range(5):
key = f"file_change_modified_{file_change_request.filename}"
file_markdown = is_markdown(file_change_request.filename)
# TODO(sweep): edge case at empty file
message = modify_file_prompt_3.format(
filename=file_change_request.filename,
instructions=file_change_request.instructions,
code=contents_line_numbers,
line_count=contents.count('\n') + 1
)
try:
if chunking:
message = chunking_prompt + message
modify_file_response = self.chat(
message,
message_key=key,
)
self.delete_messages_from_chat(key)
else:
modify_file_response = self.chat(
message,
message_key=key,
)
except Exception as e: # Check for max tokens error
if "max tokens" in str(e).lower():
logger.error(f"Max tokens exceeded for {file_change_request.filename}")
raise MaxTokensExceeded(file_change_request.filename)
try:
logger.info(
f"generate_new_file with contents: {contents} and modify_file_response: {modify_file_response}")
new_file = generate_new_file_from_patch(modify_file_response, contents, chunk_offset=chunk_offset)
if not is_markdown(file_change_request.filename):
code_repairer = CodeRepairer(chat_logger=self.chat_logger)
diff = generate_diff(old_code=contents, new_code=new_file)
if diff.strip() != "" and diff_contains_dups_or_removals(diff, new_file):
new_file = code_repairer.repair_code(diff=diff, user_code=new_file,
feature=file_change_request.instructions)
new_file = format_contents(new_file, file_markdown)
new_file = new_file.rstrip()
if contents.endswith("\n"):
new_file += "\n"
return new_file
except Exception as e:
tb = traceback.format_exc()
logger.warning(f"Failed to parse. Retrying for the {count}th time. Recieved error {e}\n{tb}")
self.delete_messages_from_chat(key)
continue
raise Exception("Failed to parse response after 5 attempts.")
def change_files_in_github(
self,
file_change_requests: list[FileChangeRequest],
branch: str,
):
# should check if branch exists, if not, create it
logger.debug(file_change_requests)
num_fcr = len(file_change_requests)
completed = 0
for file_change_request in file_change_requests:
try:
if file_change_request.change_type == "create":
self.handle_create_file(file_change_request, branch)
elif file_change_request.change_type == "modify":
self.handle_modify_file(file_change_request, branch)
except MaxTokensExceeded as e:
raise e
except Exception as e:
logger.error(f"Error in change_files_in_github {e}")
completed += 1
return completed, num_fcr
def handle_create_file(self, file_change_request: FileChangeRequest, branch: str):
try:
file_change = self.create_file(file_change_request)

chunking_prompt,
)
from sweepai.utils.config.client import SweepConfig
from sweepai.utils.config.server import DB_MODAL_INST_NAME, SECONDARY_MODEL
from sweepai.utils.diff import diff_contains_dups_or_removals, format_contents, generate_diff, generate_new_file, generate_new_file_from_patch, is_markdown
USING_DIFF = True
class MaxTokensExceeded(Exception):
def __init__(self, filename):
self.filename = filename
class CodeGenBot(ChatGPT):
def summarize_snippets(self, create_thoughts, modify_thoughts):
snippet_summarization = self.chat(
snippet_replacement.format(
thoughts=create_thoughts + "\n" + modify_thoughts
),
message_key="snippet_summarization",
)
# Delete excessive tokens
self.delete_messages_from_chat("relevant_snippets")
self.delete_messages_from_chat("relevant_directories")
self.delete_messages_from_chat("relevant_tree")
# Delete past instructions
self.delete_messages_from_chat("files_to_change", delete_assistant=False)
# Delete summarization instructions
self.delete_messages_from_chat("snippet_summarization")
msg = Message(content=snippet_summarization, role="assistant", key="bot_analysis_summary")
self.messages.insert(-2, msg)
pass
def get_files_to_change(self, retries=2):
file_change_requests: list[FileChangeRequest] = []
# Todo: put retries into a constants file
# also, this retries multiple times as the calls for this function are in a for loop
for count in range(retries):
try:
logger.info(f"Generating for the {count}th time...")
abstract_plan = self.chat(files_to_change_abstract_prompt, message_key="files_to_change")
files_to_change_response = self.chat(files_to_change_prompt,
message_key="files_to_change") # Dedup files to change here
files_to_change = FilesToChange.from_string(files_to_change_response)
create_thoughts = files_to_change.files_to_create.strip()
modify_thoughts = files_to_change.files_to_modify.strip()
files_to_create: list[str] = files_to_change.files_to_create.split("\n*")
files_to_modify: list[str] = files_to_change.files_to_modify.split("\n*")
for file_change_request, change_type in zip(
files_to_create + files_to_modify,
["create"] * len(files_to_create)
+ ["modify"] * len(files_to_modify),
):
file_change_request = file_change_request.strip()
if not file_change_request or file_change_request == "* None":
continue
logger.debug(file_change_request)
logger.debug(change_type)
file_change_requests.append(
FileChangeRequest.from_string(
file_change_request, change_type=change_type
)
)
# Create a dictionary to hold file names and their corresponding instructions
file_instructions_dict = {}
for file_change_request in file_change_requests:
# If the file name is already in the dictionary, append the new instructions
if file_change_request.filename in file_instructions_dict:
instructions, change_type = file_instructions_dict[file_change_request.filename]
file_instructions_dict[file_change_request.filename] = (
instructions + " " + file_change_request.instructions, change_type)
else:
file_instructions_dict[file_change_request.filename] = (
file_change_request.instructions, file_change_request.change_type)
file_change_requests = [
FileChangeRequest(filename=file_name, instructions=instructions, change_type=change_type) for
file_name, (instructions, change_type) in file_instructions_dict.items()]
if file_change_requests:
return file_change_requests, create_thoughts, modify_thoughts
except RegexMatchError:
logger.warning("Failed to parse! Retrying...")
self.delete_messages_from_chat("files_to_change")
continue
raise NoFilesException()
def generate_pull_request(self, retries=5) -> PullRequest:
for count in range(retries):
too_long = False
try:
logger.info(f"Generating for the {count}th time...")
if too_long or count == retries - 2: # if on last try, use gpt4-32k (improved context window)
pr_text_response = self.chat(pull_request_prompt, message_key="pull_request")
else:
pr_text_response = self.chat(pull_request_prompt, message_key="pull_request", model=SECONDARY_MODEL)
# Add triple quotes if not present
if not pr_text_response.strip().endswith('"""'):
pr_text_response += '"""'
self.delete_messages_from_chat("pull_request")
except Exception as e:
e_str = str(e)
if "too long" in e_str:
too_long = True
logger.warning(f"Exception {e_str}. Failed to parse! Retrying...")
self.delete_messages_from_chat("pull_request")
continue
pull_request = PullRequest.from_string(pr_text_response)
pull_request.branch_name = "sweep/" + pull_request.branch_name[:250]
return pull_request
raise Exception("Could not generate PR text")


Step 2: 🧐 Snippet Analysis

From looking through the relevant snippets, I decided to make the following modifications:

File Path Proposed Changes
sweepai/core/vector_db.py Define a new scoring function for filenames. This function should take a filename as input and return a score based on a specific criteria or algorithm. Replace the old scoring function calls with the new one in the relevant parts of the code.
sweepai/core/sweep_bot.py Modify the parts of the code that call the old scoring function to now use the new one. Ensure that the new function is properly imported and used.

Step 3: 📝 Planning

I have created a plan for writing the pull request. I am now working my plan and coding the required changes to address this issue. Here is the planned pull request:

Add separate scoring function for filenames in vector_db
sweep/add-filename-scoring-function

Description

This PR adds a separate scoring function for filenames in the vector_db module. The current implementation lacks a dedicated mechanism to score filenames, which can lead to suboptimal search results and inefficient operations. By introducing a new scoring function, we can improve the accuracy and efficiency of operations involving filenames.

Changes Made

  • Added a new scoring function score_filename in vector_db.py to evaluate the relevance of filenames based on a specific criteria or algorithm.
  • Replaced the old scoring function calls with the new score_filename function in relevant parts of the code.
  • Modified the sweep_bot.py file to use the new scoring function for filenames.

Testing

  • Created unit tests to verify the correctness of the new scoring function.
  • Ran existing tests to ensure that the changes did not introduce any regressions.

Related Issue

Add a separate scoring function for filenames in vector_db


Step 4: ⌨️ Coding

I have finished coding the issue. I am now reviewing it for completeness.


Step 5: 🔁 Code Review

Success! 🚀


I'm a bot that handles simple bugs and feature requests but I might make mistakes. Please be kind!
Join Our Discord

@wwzeng1 wwzeng1 added sweep Assigns Sweep to an issue or pull request. and removed sweep Assigns Sweep to an issue or pull request. labels Jul 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sweep Assigns Sweep to an issue or pull request.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant