VM Size reduction tips for OVA distribution

In my many years of exploring Linux systems and building Android Tamer, OVA export has always been a fun challenge. Everyone distribute VM’s in OVA format as it is easy to import in both VMWare and Virtual box systems. Over the period I have relied on various smallish tweaks that eventually lead to a largish difference in final OVA Size. I think it’s time I document the entire set here:

A. Localization filters 

*nix machines have multiple locale and come with in build translation system and translations. In a VM, I prefer to clear out anything other than English as a language. This reduces the space and gives me capabilities to focus on just one language. How do you achieve it?

Note: we should perform this step as early in the system’s lifespan as possible. This would ensure it requires less cleanup.

sudo apt-get update
sudo apt-get install localepurge

Once installed, we can execute it by sudo localepurge and select en_US or en_UK or whatever one or two languages you want to keep and proceed forward. This ensures new installations keep only selected locale. Therefore, it is essential to install this software as early as possible.

B. Package Cache

This is a straightforward decision, but many OVA’s I have found contain this cache. When you install a software, the package manager downloads a local copy o your .deb or .apk or .rpm file and then perform the installation. This results in extra space utilized for these caches. Steps to clean that up:

a. For Debian Systems

I. Clear the apt cache

sudo apt-get clean

II. Clear full package list:

Note: Only clean this if in dire need of space. This will save about 100mb of space but will require the user to run sudo apt-get update before they can do anything with apt.

sudo rm -rf /var/cache/apt/pkgcache.bin
sudo rm -rf /var/cache/apt/srcpkgcache.bin

b. For RedHat based Systems

sudo yum clean packages
sudo yum clean headers
sudo yum clean metadata

or

sudo yum clean all

c. For DNF based systems

sudo dnf clean all

d. For Alpine system

If alpine cache is setup then we can control it via apk cache commands however the cleanup process for alpine systems includes

sudo rm -rf /var/cache/apk/*

Alternatively, you can install packages with --no-cache option.

sudo apk add --no-cache <package>

This command will only install package but not create a cache for it.

C. Language specific package cache

In a perfect world, the above step should have been sufficient. Alas, we don’t live in a perfect world and we live in a world full of odds. One such odd is distro’s can’t keep up with the pace of development and latest and most recent versions of language modules are available via language specific package managers themselves. Such as gem or pip or npm and more.

a. Ruby Gems

Ruby gems come with built-in documentation called rdoc. This documentation would make sense for developers but would not be of use for users that too if it’s inside a VM. We can disable these from being installed either at the global level or at per command level.

for ruby before 2.6 we use following command

gem install --no-rdoc --no-ri <gem_name>

for ruby 2.6 onwards this has changed to

gem install --no-document <gem_name>

If you want to disable this at a global level, you need to place following in your ~/.gemrc file

For ruby 2.6 onwards:

gem: --no-document

For older version 2.5 or below:

gem: --no-rdoc --no-ri

This would prevent installation of documentation for every gem installed. Couple of pointers to keep in mind.

I. bundler command by default doesn’t install rdoc and ri.

II. If you ever need a local copy of gem documentation, you can get that by following commands

gem rdoc <gem_name>

or if you want to get the full set of rdoc for all gems:

gem rdoc --all --overwrite

b. Python Packages

Pip maintains its own cache to avoid duplicate http requests. You can disable that using --no-cache-dir

pip install --no-cache-dir <package_name>

c. Mostly *nix systems have a need for python and ruby packages however people might use additional package manager and each might have its own cache cleanup command. People have talked about such cleanup commands earlier, such as here. I list here some of them:

  • NPM Packages
npm cache clean --force
  • Yarn Package Manager
yarn cache clean
  • PHP Composer
composer clear-cache
go clean -cache -modcache

Note: These are cache cleanup command i.e. only useful once the cache is already built up. Also, I have not tested these commands, so use them with caution.

D. Optimize package setup

This brings us to the next option, which is to optimize your package setup. Some straightforward choices are:

a. If possible, stick to single Desktop Environment and try to pick a lightweight option (RAM and disk wise). The order for myself has been:

KDE > GNOME > MATE > XFCE > LXDE > Openbox > i3

This can vary however you need to pick one and stick to it. Mixing GTK and QT packages would cause so many dependencies from both sides.

b. Remove any unnecessary package: Ensure we install only required packages. A user can install a music player or a game if they need it. To distribute VM, focus on the primary aim in mind. During cleanup, once we remove the main package, remove the stale dependencies by

sudo apt-get autoremove

c. Remove unneeded kernel versions: For our use case, only 1 kernel version should be fine unless you specifically need 2 different versions. While performing upgrades, multiple kernel packages could be installed and autoremove can clean some of them. However, autoremove will not remove your running kernel, so if you performed upgrade of packages and didn’t reboot autoremove will leave 2 kernels. Single kernel package removal can lead to a reduction of about 200 mb or more disk space.

E. Docker cleanup

A lot of VM’s now come with docker installed and docker containers in them. We need to clear out the docker caches and remove any image that is not required. This is achieved by docker prune command

docker system prune -a --volumes

This will clear following:

  • all stopped containers
  • all networks not used by at least one container
  • all volumes not used by at least one container
  • all images without at least one container associated to them
  • all build cache

Once cleanup is performed, pull any missing image again via:

docker pull <image>

F. Version Control repositories

Many times we endup using projects from git based repositories. Some space saving tips for them

  1. Avoid git clone if possible download the zip file and unzip the content.
  2. The git details are stored in .git folder. Unless required remove .git folder that will save lots of space as it will generally be many time bigger then the code itself.
  3. If git clone is unavoidable use –depth parameter
git clone --depth=1 <URL>       

This ensures only the last layer is git cloned and not the entire repository.

G. Vagrant specific tip

For anyone building VM’s I highly recommend leveraging vagrant and automated build scripts like ansible, chef or puppet. I have done the same for AndroidTamer. If you take my advice and use vagrant for your VM Building exercise, I highly recommend an ancient and unmaintained plugin vagrant-cachier. I have looked all the places and have found nothing else similar to this plugin. The beauty of this plugin is in the caching setup it does. When you run a VM with vagrant cachier configured example here and here you create a box level cache of all the packages and other buckets.

H. Slack space cleanup

This is the most funny stuff you need to do to reclaim storage space. Let’s first understand what is slack space, and then we talk about how it helps. And for those uninitiated, it’s not a new service by slack.com

a. In Unix everything is a file.

b. Folder is a file containing the list of files inside the folder.

c. Every file has a record pointing to its disk location. However, by default the content of the file is not deleted rather inode is marked as available for file system.

d. File system may choose to overwrite it or leave it as is.

e. In forensics this is exactly the slack space which is leveraged to recover deleted files.

For our use case: all the cache cleanup and other activities we performed have left a large slack space on the disk. When Virtualbox or VMWare are tasked to create a OVA from this virtual machine. They can’t differentiate between slack space and actual files and hence it copies all the data in the OVA, even though compression is used, the data is still random and hence would take up larger space. This results in a net increase in size of the OVA. However, if we change the slack space in such a way that it contains only zero’s, the entire space can be compressed, reducing overall OVA size.

We will achieve this by following steps.

I. We will create a file with all zero’s inside it and will fill the full space.

sudo dd if=/dev/zero of=zerofill bs=512K

This file will fill up all free space and forcing the system to overwrite all the free space with zero in it.

II. We then remove the file

rm -rf zerofill

III. Now we shutdown the VM. I prefer cloning the VM and then taking a OVA export of the cloned VM. However your milage may vary and you might have better results by directly taking the OVA export.

Hope these help in creating a slimmer version of OVA and help in reducing the overall capacity usage for downloading those files.

Do you like what you read, What to share it

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.