Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libcdb: improve the search speed of search_by_symbol_offsets #2413

Open
wants to merge 15 commits into
base: dev
Choose a base branch
from

Conversation

the-soloist
Copy link
Contributor

@the-soloist the-soloist commented May 26, 2024

While using search_by_symbol_offsets, I found that the search speed for build_id was significantly slower compared to other hash types.

# https://github.com/Gallopsled/pwntools/blob/dev/pwnlib/libcdb.py#L26-L30
HASHES = {
    'build_id': lambda path: enhex(ELF(path, checksec=False).buildid or b''),
    'sha1': sha1filehex,
    'sha256': sha256filehex,
    'md5': md5filehex,
}

The reason for this is that ELF loads too many things. I attempted to replace it with ELFFile, which noticeably improved the speed, but it introduced redundant functionality. I couldn't think of a simple way to implement it, so I added a hash_type parameter to search_by_symbol_offsets, with a default setting of md5 to speed up search_by_symbol_offsets, and provide users with a controllable option.

I'm testing on the following code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import os
from elftools.elf.elffile import ELFFile
from pwn import *


context.log_level = "info"
context.local_libcdb = "/path/to/libc-database"


def _buildid(path):
    elf = ELFFile(open(path, "rb"))
    section = elf.get_section_by_name('.note.gnu.build-id')
    if section:
        return enhex(section.data()[16:])
    return b""


log.waitfor("searching build_id")
os.system("rm -rf ~/.cache/.pwntools-cache-*")
time_start = time.time()
path = libcdb.search_by_build_id("70a4c953a01ddc232969c27031e7f948338ca137", offline_only=True, unstrip=False)
libc = ELF(path, checksec=False)
print(f"cost {time.time() - time_start}s", libc)


log.waitfor("searching symbol offsets (build_id)")
os.system("rm -rf ~/.cache/.pwntools-cache-*")
time_start = time.time()
path = libcdb.search_by_symbol_offsets({'puts': 0xa30, 'printf': 0x8f0}, offline_only=True, unstrip=False, hash_type="build_id")
libc = ELF(path, checksec=False)
print(f"cost {time.time() - time_start}s", libc)


log.waitfor("searching symbol offsets (md5)")
os.system("rm -rf ~/.cache/.pwntools-cache-*")
time_start = time.time()
path = libcdb.search_by_symbol_offsets({'puts': 0xa30, 'printf': 0x8f0}, offline_only=True, unstrip=False, hash_type="md5")
libc = ELF(path, checksec=False)
print(f"cost {time.time() - time_start}s", libc)


log.success("patch libcdb.HASHES")
libcdb.HASHES["build_id"] = _buildid


log.waitfor("searching build_id (with ELFFile)")
os.system("rm -rf ~/.cache/.pwntools-cache-*")
time_start = time.time()
path = libcdb.search_by_build_id("70a4c953a01ddc232969c27031e7f948338ca137", offline_only=True, unstrip=False)
libc = ELF(path, checksec=False)
print(f"cost {time.time() - time_start}s", libc)


log.waitfor("searching symbol offsets (build_id with ELFFile)")
os.system("rm -rf ~/.cache/.pwntools-cache-*")
time_start = time.time()
path = libcdb.search_by_symbol_offsets({'puts': 0xa30, 'printf': 0x8f0}, offline_only=True, unstrip=False, hash_type="build_id")
libc = ELF(path, checksec=False)
print(f"cost {time.time() - time_start}s", libc)

image

and found another question #2414

@peace-maker
Copy link
Member

I think we can avoid walking the local database directory again here in the first place instead. When finding a match in the local libc-database, we know the id and thus the filename of the libc we want to return. Maybe allow the id to be searched in search_by_hash and special case it in the local_database provider.

@the-soloist
Copy link
Contributor Author

I agree that handling id separately within the providers is a good approach, it allows the use of libcdb's caching feature. However, this will cause some variable name to lose its original meaning (it's not hash type). I've tried writing some code, could you give me some suggestions?

@Arusekk
Copy link
Member

Arusekk commented Jun 15, 2024

I'm not sure I like hash_type="id" (maybe hash_type="filename" would be better?). I think the build ID should be the default, it should just be parsed quicker, maybe we can have a separate function for extracting build id (at C speed ideally), but come on, reading only the first page of a file should be quicker than reading all of it, especially on HDDs; also, build-id does not change if you strip/unstrip or move the file around. If our ELF implementation is a bottleneck, we can resort to implementing separate functionality just for turbofast build-id extraction.

@the-soloist
Copy link
Contributor Author

Sorry too late. I added a _turbofast_extract_build_id function and made some adjustments to the variable names.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import os
from pwn import *

context.log_level = "info"
context.local_libcdb = "/root/S3cur1ty/libc-database"


log.waitfor("searching build_id")
os.system("rm -rf ~/.cache/.pwntools-cache-*")
time_start = time.time()
path = libcdb.search_by_build_id("6ee9454b96efa9e343f9e8105f2fa4529265ea05", offline_only=True, unstrip=False)
libc = ELF(path, checksec=False)
print(f"cost {time.time() - time_start}s", libc)

image

pwnlib/libcdb.py Outdated Show resolved Hide resolved
CHANGELOG.md Outdated
@@ -83,6 +83,7 @@ The table below shows which release corresponds to each branch, and what date th
- [#2376][2376] Return buffered data on first EOF in tube.readline()
- [#2387][2387] Convert apport_corefile() output from bytes-like object to string
- [#2388][2388] libcdb: add `offline_only` to `search_by_symbol_offsets`
- [#2413][2413] libcdb: improve the search speed of `search_by_symbol_offsets`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rebase on latest dev please and move this to the 4.15.0 changelog

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how to rebase just the CHANGELOG.md. Do I need open a new PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants