Optimization

Memory Profiling

Feluda uses memray for memory profiling. To profile a specific operator, install feluda core requirements.txt and then

# for a specific operator
$ python3 -m memray run -o vid_vec_rep_resnet.bin vid_vec_rep_resnet.py
# Analyze using flamegraph
$ python3 -m memray flamegraph vid_vec_rep_resnet.bin

CPU Profiling

Feluda uses pyinstrument for cpu profiling. See known issues with docker code. To profile a specific operator, install feluda core requirements.txt and then

# for a specific operator
$ pyinstrument -r speedscope -o speedscope_vid_vec_rep_resnet.json vid_vec_rep_resnet.py
# load the json file here to view flamegraph - https://www.speedscope.app/

cProfile

To profile code using cProfile. Make the following changes in the code. In the video vec operator, add a if __name__ == "__main__": block.

# import required libraries
import cProfile
import pstats
from io import StringIO

# if block at the end of the operator
if __name__ == "__main__":
    file_path = {"path": r"/path/to/video/file"}
    initialize(param=None)
    profiler = cProfile.Profile()
    profiler.enable()
    run(file_path)
    profiler.disable()
    result_stream = StringIO()
    stats = pstats.Stats(profiler, stream=result_stream).sort_stats('cumulative')
    stats.print_stats()
    print(result_stream.getvalue())

The output will be the results of the profiled code. We can also save the output in a txt file like this

python vid_vec_rep_resnet.py > output.txt

Time

to find how long the run function takes, we can try a simple approach as well.

import time

start_time = time.time()
run(file_path)
end_time = time.time()
duration = end_time - start_time

Dockerization

The Dockerfiles in src/api-server and src/indexer implement multistage builds to reduce image size (from approximately 5 gb each to 1.6 gb each). Since both the server and indexer have the same dependencies, it could be useful to push the first stage of their Docker builds to Dockerhub as a separate image, and then pull that image in the Dockerfiles.

Building the first stage and pushing it to a Dockerhub repository -

cd src/api-server
docker build --target builder -t username/repository:tag .
docker push -t username/repository:tag

And then replacing the following code in both the Dockerfiles -

FROM python:3.7-slim as builder

RUN apt-get update \
    && apt-get -y upgrade \
    && apt-get install -y \
        --no-install-recommends gcc build-essential \
        --no-install-recommends libgl1-mesa-glx libglib2.0-0 \
        # Vim is only for debugging in dev mode. Uncomment in production
        vim \
    && apt-get purge -y --auto-remove \
        gcc build-essential \
        libgl1-mesa-glx libglib2.0-0 \
    && rm -rf /var/lib/apt/lists/*

RUN pip install --upgrade pip

COPY requirements.txt /app/requirements.txt
WORKDIR /app
RUN pip install --user -r requirements.txt

with -

FROM username/repository:tag AS builder

Note that this builder image would need to be rebuilt if there is any change in the dependencies.

Docker multi-arch builds

https://www.docker.com/blog/getting-started-with-docker-for-arm-on-linux/

Install docker
Install docker-buildx from your package manager
On every reboot - Register Arm executables to run on x64 machines The official binfmt project is now part of linuxkit

OLD - https://github.com/docker/binfmt

https://hub.docker.com/r/docker/binfmt/tags

CURRENT - use the latest version from this docker repo

# Register Arm executables to run on x64 machines
$ docker run --rm --privileged linuxkit/binfmt:68604c81876812ca1c9e2d9f098c28f463713e61-amd64

# To verify the qemu handlers are registered properly, run the following and make sure the first line of the output is “enabled”.  Note that the handler registration doesn’t survive a reboot, but could be added to the system start-up scripts.
$ cat /proc/sys/fs/binfmt_misc/qemu-aarch64

Optimized PyTorch on AWS Graviton (arm64)

https://www.youtube.com/watch?v=c1Rl-vCmnT0 See ~@14:40 for AWS Graviton optimization
https://pytorch.org/blog/optimized-pytorch-w-graviton/
https://aws.amazon.com/blogs/machine-learning/optimized-pytorch-2-0-inference-with-aws-graviton-processors/
https://github.com/aws/aws-graviton-getting-started/blob/main/machinelearning/pytorch.md

In Dockerfile

Ensure that the Docker base OS is compatible with AWS Graviton
- https://github.com/aws/aws-graviton-getting-started/blob/main/os.md
OPTIONAL - remove hash value for base image - still picked up correct arch using docker buildx
turn on AWS Graviton optimization

# Graviton3(E) (e.g. c7g, c7gn and Hpc7g instances) supports BF16 format for ML acceleration. This can be enabled in oneDNN by setting the below environment variable
grep -q bf16 /proc/cpuinfo && export DNNL_DEFAULT_FPMATH_MODE=BF16

# Enable primitive caching to avoid the redundant primitive allocation
# latency overhead. Please note this caching feature increases the
# memory footprint. Tune this cache capacity to a lower value to
# reduce the additional memory requirement.
export LRU_CACHE_CAPACITY=1024

# Enable Transparent huge page allocations from PyTorch C10 allocator
export THP_MEM_ALLOC_ENABLE=1

# Make sure the openmp threads are distributed across all the processes for multi process applications to avoid over subscription for the vcpus. For example if there is a single application process, then num_processes should be set to '1' so that all the vcpus are assigned to it with one-to-one mapping to omp threads

num_vcpus=$(getconf _NPROCESSORS_ONLN)
num_processes=<number of processes>
export OMP_NUM_THREADS=$((1 > ($num_vcpus/$num_processes) ? 1 : ($num_vcpus/$num_processes)))
export OMP_PROC_BIND=false
export OMP_PLACES=cores

In requirements.txt
- Use/Replace find links to - https://download.pytorch.org/whl/cpu
- Remove torch version - TODO - try pinned versioning again with above find link
This is required since using pip-compile with above settings still downloads GPU version

Build

# Build
$ sudo docker buildx build --platform linux/arm64 -t image-operator -f Dockerfile.image_vec_rep_resnet .

# Verify
$ sudo docker inspect image-operator | grep 'Architecture'
# sample output
  "Architecture": "arm64",

# Verify env vars have been set
$ sudo docker inspect image-operator --format "{{.Config.Env}}"

Running multi-arch docker images on x86_64

https://stackoverflow.com/questions/68675532/how-to-run-arm64-docker-images-on-amd64-host-platform
- https://github.com/multiarch/qemu-user-static
- https://www.reddit.com/r/docker/comments/c75uhq/how_to_run_arm64_containers_from_amd64_hosts_and/
Install qemu-user-static from package manager

$ docker run --rm --privileged multiarch/qemu-user-static --reset -p yes
$ sudo docker run --rm image-operator uname -m
$ sudo docker run --platform linux/arm64 -v /usr/bin/qemu-aarch64-static:/usr/bin/qemu-aarch64-static -it image-operator

Use the following settings for running the image via a docker compose file

container_name:
  image: <built-arm-image>
  platform: linux/arm64
  volumes:
    - /usr/bin/qemu-aarch64-static:/usr/bin/qemu-aarch64-static

Postgres benchmarking

https://www.postgresql.org/docs/16/pgbench.html

Testing insert performance

$ sudo -iu postgres

# create user as superuser with db creation privilege
[postgres]$ createuser -s -d postgres_benchmark

# create db owned by owner
[postgres]$ createdb insert_benchmark -O postgres_benchmark

[postgres]$ exit

# init with empty database 
$ pgbench -i -s 50 insert_benchmark -U postgres_benchmark

# run insert command
$ pgbench -c 89 -j 1 -t 1000 -P 5 -f <(echo 'INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (1, 2, 3, 4, current_timestamp)') insert_benchmark

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization

Memory Profiling

CPU Profiling

cProfile

Time

Dockerization

Docker multi-arch builds

Postgres benchmarking

Testing insert performance

Feluda Wiki

Setup Guides

Modules

Other Misc

Learning

Clone this wiki locally