Skip to content

Optimization

Aurora edited this page Mar 17, 2024 · 17 revisions

Memory Profiling

Feluda uses memray for memory profiling. To profile a specific operator, install feluda core requirements.txt and then

# for a specific operator
$ python3 -m memray run -o vid_vec_rep_resnet.bin vid_vec_rep_resnet.py
# Analyze using flamegraph
$ python3 -m memray flamegraph vid_vec_rep_resnet.bin

CPU Profiling

Feluda uses pyinstrument for cpu profiling. See known issues with docker code. To profile a specific operator, install feluda core requirements.txt and then

# for a specific operator
$ pyinstrument -r speedscope -o speedscope_vid_vec_rep_resnet.json vid_vec_rep_resnet.py
# load the json file here to view flamegraph - https://www.speedscope.app/

cProfile

To profile code using cProfile. Make the following changes in the code. In the video vec operator, add a if __name__ == "__main__": block.

# import required libraries
import cProfile
import pstats
from io import StringIO

# if block at the end of the operator
if __name__ == "__main__":
    file_path = {"path": r"/path/to/video/file"}
    initialize(param=None)
    profiler = cProfile.Profile()
    profiler.enable()
    run(file_path)
    profiler.disable()
    result_stream = StringIO()
    stats = pstats.Stats(profiler, stream=result_stream).sort_stats('cumulative')
    stats.print_stats()
    print(result_stream.getvalue())

The output will be the results of the profiled code. We can also save the output in a txt file like this

python vid_vec_rep_resnet.py > output.txt

Time

to find how long the run function takes, we can try a simple approach as well.

import time

start_time = time.time()
run(file_path)
end_time = time.time()
duration = end_time - start_time

Dockerization

The Dockerfiles in src/api-server and src/indexer implement multistage builds to reduce image size (from approximately 5 gb each to 1.6 gb each). Since both the server and indexer have the same dependencies, it could be useful to push the first stage of their Docker builds to Dockerhub as a separate image, and then pull that image in the Dockerfiles.

Building the first stage and pushing it to a Dockerhub repository -

cd src/api-server
docker build --target builder -t username/repository:tag .
docker push -t username/repository:tag

And then replacing the following code in both the Dockerfiles -

FROM python:3.7-slim as builder

RUN apt-get update \
    && apt-get -y upgrade \
    && apt-get install -y \
        --no-install-recommends gcc build-essential \
        --no-install-recommends libgl1-mesa-glx libglib2.0-0 \
        # Vim is only for debugging in dev mode. Uncomment in production
        vim \
    && apt-get purge -y --auto-remove \
        gcc build-essential \
        libgl1-mesa-glx libglib2.0-0 \
    && rm -rf /var/lib/apt/lists/*

RUN pip install --upgrade pip

COPY requirements.txt /app/requirements.txt
WORKDIR /app
RUN pip install --user -r requirements.txt

with -

FROM username/repository:tag AS builder

Note that this builder image would need to be rebuilt if there is any change in the dependencies.

Docker multi-arch builds

  1. Install docker
  2. Install docker-buildx from your package manager
  3. On every reboot - Register Arm executables to run on x64 machines The official binfmt project is now part of linuxkit

OLD - https://github.com/docker/binfmt

CURRENT - use the latest version from this docker repo

# Register Arm executables to run on x64 machines
$ docker run --rm --privileged linuxkit/binfmt:68604c81876812ca1c9e2d9f098c28f463713e61-amd64

# To verify the qemu handlers are registered properly, run the following and make sure the first line of the output is “enabled”.  Note that the handler registration doesn’t survive a reboot, but could be added to the system start-up scripts.
$ cat /proc/sys/fs/binfmt_misc/qemu-aarch64
  1. Optimized PyTorch on AWS Graviton (arm64)
  • https://www.youtube.com/watch?v=c1Rl-vCmnT0 See ~@14:40 for AWS Graviton optimization

  • https://pytorch.org/blog/optimized-pytorch-w-graviton/

  • https://aws.amazon.com/blogs/machine-learning/optimized-pytorch-2-0-inference-with-aws-graviton-processors/

  • https://github.com/aws/aws-graviton-getting-started/blob/main/machinelearning/pytorch.md

  • In Dockerfile

    # Graviton3(E) (e.g. c7g, c7gn and Hpc7g instances) supports BF16 format for ML acceleration. This can be enabled in oneDNN by setting the below environment variable
    grep -q bf16 /proc/cpuinfo && export DNNL_DEFAULT_FPMATH_MODE=BF16
    
    # Enable primitive caching to avoid the redundant primitive allocation
    # latency overhead. Please note this caching feature increases the
    # memory footprint. Tune this cache capacity to a lower value to
    # reduce the additional memory requirement.
    export LRU_CACHE_CAPACITY=1024
    
    # Enable Transparent huge page allocations from PyTorch C10 allocator
    export THP_MEM_ALLOC_ENABLE=1
    
    # Make sure the openmp threads are distributed across all the processes for multi process applications to avoid over subscription for the vcpus. For example if there is a single application process, then num_processes should be set to '1' so that all the vcpus are assigned to it with one-to-one mapping to omp threads
    
    num_vcpus=$(getconf _NPROCESSORS_ONLN)
    num_processes=<number of processes>
    export OMP_NUM_THREADS=$((1 > ($num_vcpus/$num_processes) ? 1 : ($num_vcpus/$num_processes)))
    export OMP_PROC_BIND=false
    export OMP_PLACES=cores
    
  • In requirements.txt

    This is required since using pip-compile with above settings still downloads GPU version

  1. Build
# Build
$ sudo docker buildx build --platform linux/arm64 -t image-operator -f Dockerfile.image_vec_rep_resnet .

# Verify
$ sudo docker inspect image-operator | grep 'Architecture'
# sample output
  "Architecture": "arm64",

# Verify env vars have been set
$ sudo docker inspect image-operator --format "{{.Config.Env}}"
  1. Running multi-arch docker images on x86_64
$ docker run --rm --privileged multiarch/qemu-user-static --reset -p yes
$ sudo docker run --rm image-operator uname -m
$ sudo docker run --platform linux/arm64 -v /usr/bin/qemu-aarch64-static:/usr/bin/qemu-aarch64-static -it image-operator
  • Use the following settings for running the image via a docker compose file
container_name:
  image: <built-arm-image>
  platform: linux/arm64
  volumes:
    - /usr/bin/qemu-aarch64-static:/usr/bin/qemu-aarch64-static

Postgres benchmarking

Testing insert performance

$ sudo -iu postgres

# create user as superuser with db creation privilege
[postgres]$ createuser -s -d postgres_benchmark

# create db owned by owner
[postgres]$ createdb insert_benchmark -O postgres_benchmark

[postgres]$ exit

# init with empty database 
$ pgbench -i -s 50 insert_benchmark -U postgres_benchmark

# run insert command
$ pgbench -c 89 -j 1 -t 1000 -P 5 -f <(echo 'INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (1, 2, 3, 4, current_timestamp)') insert_benchmark