libcdb: improve the search speed of `search_by_symbol_offsets` #2413

the-soloist · 2024-05-26T13:10:48Z

While using search_by_symbol_offsets, I found that the search speed for build_id was significantly slower compared to other hash types.

# https://github.com/Gallopsled/pwntools/blob/dev/pwnlib/libcdb.py#L26-L30
HASHES = {
    'build_id': lambda path: enhex(ELF(path, checksec=False).buildid or b''),
    'sha1': sha1filehex,
    'sha256': sha256filehex,
    'md5': md5filehex,
}

The reason for this is that ELF loads too many things. I attempted to replace it with ELFFile, which noticeably improved the speed, but it introduced redundant functionality. I couldn't think of a simple way to implement it, so I added a hash_type parameter to search_by_symbol_offsets, with a default setting of md5 to speed up search_by_symbol_offsets, and provide users with a controllable option.

I'm testing on the following code:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import os
from elftools.elf.elffile import ELFFile
from pwn import *


context.log_level = "info"
context.local_libcdb = "/path/to/libc-database"


def _buildid(path):
    elf = ELFFile(open(path, "rb"))
    section = elf.get_section_by_name('.note.gnu.build-id')
    if section:
        return enhex(section.data()[16:])
    return b""


log.waitfor("searching build_id")
os.system("rm -rf ~/.cache/.pwntools-cache-*")
time_start = time.time()
path = libcdb.search_by_build_id("70a4c953a01ddc232969c27031e7f948338ca137", offline_only=True, unstrip=False)
libc = ELF(path, checksec=False)
print(f"cost {time.time() - time_start}s", libc)


log.waitfor("searching symbol offsets (build_id)")
os.system("rm -rf ~/.cache/.pwntools-cache-*")
time_start = time.time()
path = libcdb.search_by_symbol_offsets({'puts': 0xa30, 'printf': 0x8f0}, offline_only=True, unstrip=False, hash_type="build_id")
libc = ELF(path, checksec=False)
print(f"cost {time.time() - time_start}s", libc)


log.waitfor("searching symbol offsets (md5)")
os.system("rm -rf ~/.cache/.pwntools-cache-*")
time_start = time.time()
path = libcdb.search_by_symbol_offsets({'puts': 0xa30, 'printf': 0x8f0}, offline_only=True, unstrip=False, hash_type="md5")
libc = ELF(path, checksec=False)
print(f"cost {time.time() - time_start}s", libc)


log.success("patch libcdb.HASHES")
libcdb.HASHES["build_id"] = _buildid


log.waitfor("searching build_id (with ELFFile)")
os.system("rm -rf ~/.cache/.pwntools-cache-*")
time_start = time.time()
path = libcdb.search_by_build_id("70a4c953a01ddc232969c27031e7f948338ca137", offline_only=True, unstrip=False)
libc = ELF(path, checksec=False)
print(f"cost {time.time() - time_start}s", libc)


log.waitfor("searching symbol offsets (build_id with ELFFile)")
os.system("rm -rf ~/.cache/.pwntools-cache-*")
time_start = time.time()
path = libcdb.search_by_symbol_offsets({'puts': 0xa30, 'printf': 0x8f0}, offline_only=True, unstrip=False, hash_type="build_id")
libc = ELF(path, checksec=False)
print(f"cost {time.time() - time_start}s", libc)

and found another question #2414

peace-maker · 2024-06-03T15:54:08Z

I think we can avoid walking the local database directory again here in the first place instead. When finding a match in the local libc-database, we know the id and thus the filename of the libc we want to return. Maybe allow the id to be searched in search_by_hash and special case it in the local_database provider.

the-soloist · 2024-06-07T03:07:45Z

I agree that handling id separately within the providers is a good approach, it allows the use of libcdb's caching feature. However, this will cause some variable name to lose its original meaning (it's not hash type). I've tried writing some code, could you give me some suggestions?

Arusekk · 2024-06-15T15:35:34Z

I'm not sure I like hash_type="id" (maybe hash_type="filename" would be better?). I think the build ID should be the default, it should just be parsed quicker, maybe we can have a separate function for extracting build id (at C speed ideally), but come on, reading only the first page of a file should be quicker than reading all of it, especially on HDDs; also, build-id does not change if you strip/unstrip or move the file around. If our ELF implementation is a bottleneck, we can resort to implementing separate functionality just for turbofast build-id extraction.

the-soloist · 2024-08-15T14:00:25Z

Sorry too late. I added a _turbofast_extract_build_id function and made some adjustments to the variable names.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import os
from pwn import *

context.log_level = "info"
context.local_libcdb = "/root/S3cur1ty/libc-database"


log.waitfor("searching build_id")
os.system("rm -rf ~/.cache/.pwntools-cache-*")
time_start = time.time()
path = libcdb.search_by_build_id("6ee9454b96efa9e343f9e8105f2fa4529265ea05", offline_only=True, unstrip=False)
libc = ELF(path, checksec=False)
print(f"cost {time.time() - time_start}s", libc)

pwnlib/libcdb.py

peace-maker · 2024-08-15T15:49:41Z

CHANGELOG.md

@@ -83,6 +83,7 @@ The table below shows which release corresponds to each branch, and what date th
 - [#2376][2376] Return buffered data on first EOF in tube.readline()
 - [#2387][2387] Convert apport_corefile() output from bytes-like object to string
 - [#2388][2388] libcdb: add `offline_only` to `search_by_symbol_offsets`
+- [#2413][2413] libcdb: improve the search speed of `search_by_symbol_offsets`


rebase on latest dev please and move this to the 4.15.0 changelog

I don't know how to rebase just the CHANGELOG.md. Do I need open a new PR?

the-soloist added 3 commits May 26, 2024 20:53

Add hash_type for search_by_symbol_offsets

f40a84e

Add docs

c9efa2d

Update CHANGELOG

62cdf0e

Arusekk approved these changes Jun 3, 2024

View reviewed changes

the-soloist added 2 commits June 7, 2024 11:00

Allow search id in search_by_hash

f7db717

Fix py2.7 test

cf52dc9

the-soloist added 4 commits August 15, 2024 17:54

Rename hash_type to search_type

3984b33

Rename TYPES['id'] to TYPES['libs_id']

34bbcc8

Rename part hex_encoded_id to search_target

541a831

Turbofast extract build id

0f7d4ae

Fix docs

788d82d

the-soloist force-pushed the dev branch from b8c412c to 788d82d Compare August 15, 2024 14:19

peace-maker reviewed Aug 15, 2024

View reviewed changes

pwnlib/libcdb.py Show resolved Hide resolved

peace-maker reviewed Aug 15, 2024

View reviewed changes

pwnlib/libcdb.py Outdated Show resolved Hide resolved

peace-maker reviewed Aug 15, 2024

View reviewed changes

the-soloist and others added 4 commits August 16, 2024 17:10

Add a map for types key

335b3b9

Extract proper buildid

13f2753

Fix docs

afe9bf4

Merge branch 'dev' into dev

1b6b2f6

the-soloist force-pushed the dev branch from 070fd13 to 1b6b2f6 Compare September 19, 2024 03:15

Fix E0606

263f67e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libcdb: improve the search speed of `search_by_symbol_offsets` #2413

libcdb: improve the search speed of `search_by_symbol_offsets` #2413

the-soloist commented May 26, 2024 •

edited

Loading

peace-maker commented Jun 3, 2024

the-soloist commented Jun 7, 2024

Arusekk commented Jun 15, 2024

the-soloist commented Aug 15, 2024

peace-maker Aug 15, 2024

the-soloist Aug 16, 2024

libcdb: improve the search speed of search_by_symbol_offsets #2413

Are you sure you want to change the base?

libcdb: improve the search speed of search_by_symbol_offsets #2413

Conversation

the-soloist commented May 26, 2024 • edited Loading

peace-maker commented Jun 3, 2024

the-soloist commented Jun 7, 2024

Arusekk commented Jun 15, 2024

the-soloist commented Aug 15, 2024

peace-maker Aug 15, 2024

Choose a reason for hiding this comment

the-soloist Aug 16, 2024

Choose a reason for hiding this comment

libcdb: improve the search speed of `search_by_symbol_offsets` #2413

libcdb: improve the search speed of `search_by_symbol_offsets` #2413

the-soloist commented May 26, 2024 •

edited

Loading