In my many years of exploring Linux systems and building Android Tamer, OVA export has always been a fun challenge. Everyone distribute VM’s in OVA format as it is easy to import in both VMWare and Virtual box systems. Over the period I have relied on various smallish tweaks that eventually lead to a largish difference in final OVA Size. I think it’s time I document the entire set here:
A. Localization filters
*nix machines have multiple locale and come with in build translation system and translations. In a VM, I prefer to clear out anything other than English as a language. This reduces the space and gives me capabilities to focus on just one language. How do you achieve it?
Note: we should perform this step as early in the system’s lifespan as possible. This would ensure it requires less cleanup.
sudo apt-get update sudo apt-get install localepurge
Once installed, we can execute it by
sudo localepurge and select en_US or en_UK or whatever one or two languages you want to keep and proceed forward. This ensures new installations keep only selected locale. Therefore, it is essential to install this software as early as possible.
B. Package Cache
This is a straightforward decision, but many OVA’s I have found contain this cache. When you install a software, the package manager downloads a local copy o your .deb or .apk or .rpm file and then perform the installation. This results in extra space utilized for these caches. Steps to clean that up:
a. For Debian Systems
I. Clear the apt cache
sudo apt-get clean
II. Clear full package list:
Note: Only clean this if in dire need of space. This will save about 100mb of space but will require the user to run
sudo apt-get update before they can do anything with apt.
sudo rm -rf /var/cache/apt/pkgcache.bin sudo rm -rf /var/cache/apt/srcpkgcache.bin
b. For RedHat based Systems
sudo yum clean packages sudo yum clean headers sudo yum clean metadata
sudo yum clean all
c. For DNF based systems
sudo dnf clean all
d. For Alpine system
If alpine cache is setup then we can control it via
apk cache commands however the cleanup process for alpine systems includes
sudo rm -rf /var/cache/apk/*
Alternatively, you can install packages with
sudo apk add --no-cache <package>
This command will only install package but not create a cache for it.
C. Language specific package cache
In a perfect world, the above step should have been sufficient. Alas, we don’t live in a perfect world and we live in a world full of odds. One such odd is distro’s can’t keep up with the pace of development and latest and most recent versions of language modules are available via language specific package managers themselves. Such as
npm and more.
a. Ruby Gems
Ruby gems come with built-in documentation called rdoc. This documentation would make sense for developers but would not be of use for users that too if it’s inside a VM. We can disable these from being installed either at the global level or at per command level.
for ruby before 2.6 we use following command
gem install --no-rdoc --no-ri <gem_name>
for ruby 2.6 onwards this has changed to
gem install --no-document <gem_name>
If you want to disable this at a global level, you need to place following in your
For ruby 2.6 onwards:
For older version 2.5 or below:
gem: --no-rdoc --no-ri
This would prevent installation of documentation for every gem installed. Couple of pointers to keep in mind.
bundler command by default doesn’t install rdoc and ri.
II. If you ever need a local copy of gem documentation, you can get that by following commands
gem rdoc <gem_name>
or if you want to get the full set of rdoc for all gems:
gem rdoc --all --overwrite
b. Python Packages
Pip maintains its own cache to avoid duplicate http requests. You can disable that using
pip install --no-cache-dir <package_name>
c. Mostly *nix systems have a need for python and ruby packages however people might use additional package manager and each might have its own cache cleanup command. People have talked about such cleanup commands earlier, such as here. I list here some of them:
- NPM Packages
npm cache clean --force
- Yarn Package Manager
yarn cache clean
- PHP Composer
- GO lang cache
go clean -cache -modcache
Note: These are cache cleanup command i.e. only useful once the cache is already built up. Also, I have not tested these commands, so use them with caution.
D. Optimize package setup
This brings us to the next option, which is to optimize your package setup. Some straightforward choices are:
a. If possible, stick to single Desktop Environment and try to pick a lightweight option (RAM and disk wise). The order for myself has been:
KDE > GNOME > MATE > XFCE > LXDE > Openbox > i3
This can vary however you need to pick one and stick to it. Mixing GTK and QT packages would cause so many dependencies from both sides.
b. Remove any unnecessary package: Ensure we install only required packages. A user can install a music player or a game if they need it. To distribute VM, focus on the primary aim in mind. During cleanup, once we remove the main package, remove the stale dependencies by
sudo apt-get autoremove
c. Remove unneeded kernel versions: For our use case, only 1 kernel version should be fine unless you specifically need 2 different versions. While performing upgrades, multiple kernel packages could be installed and autoremove can clean some of them. However, autoremove will not remove your running kernel, so if you performed upgrade of packages and didn’t reboot autoremove will leave 2 kernels. Single kernel package removal can lead to a reduction of about 200 mb or more disk space.
E. Docker cleanup
A lot of VM’s now come with docker installed and docker containers in them. We need to clear out the docker caches and remove any image that is not required. This is achieved by docker prune command
docker system prune -a --volumes
This will clear following:
- all stopped containers
- all networks not used by at least one container
- all volumes not used by at least one container
- all images without at least one container associated to them
- all build cache
Once cleanup is performed, pull any missing image again via:
docker pull <image>
F. Version Control repositories
Many times we endup using projects from git based repositories. Some space saving tips for them
- Avoid git clone if possible download the zip file and unzip the content.
- The git details are stored in .git folder. Unless required remove .git folder that will save lots of space as it will generally be many time bigger then the code itself.
- If git clone is unavoidable use –depth parameter
git clone --depth=1 <URL>
This ensures only the last layer is git cloned and not the entire repository.
G. Vagrant specific tip
For anyone building VM’s I highly recommend leveraging vagrant and automated build scripts like ansible, chef or puppet. I have done the same for AndroidTamer. If you take my advice and use vagrant for your VM Building exercise, I highly recommend an ancient and unmaintained plugin vagrant-cachier. I have looked all the places and have found nothing else similar to this plugin. The beauty of this plugin is in the caching setup it does. When you run a VM with vagrant cachier configured example here and here you create a box level cache of all the packages and other buckets.
H. Slack space cleanup
This is the most funny stuff you need to do to reclaim storage space. Let’s first understand what is slack space, and then we talk about how it helps. And for those uninitiated, it’s not a new service by slack.com
a. In Unix everything is a file.
b. Folder is a file containing the list of files inside the folder.
c. Every file has a record pointing to its disk location. However, by default the content of the file is not deleted rather inode is marked as available for file system.
d. File system may choose to overwrite it or leave it as is.
e. In forensics this is exactly the slack space which is leveraged to recover deleted files.
For our use case: all the cache cleanup and other activities we performed have left a large slack space on the disk. When Virtualbox or VMWare are tasked to create a OVA from this virtual machine. They can’t differentiate between slack space and actual files and hence it copies all the data in the OVA, even though compression is used, the data is still random and hence would take up larger space. This results in a net increase in size of the OVA. However, if we change the slack space in such a way that it contains only zero’s, the entire space can be compressed, reducing overall OVA size.
We will achieve this by following steps.
I. We will create a file with all zero’s inside it and will fill the full space.
sudo dd if=/dev/zero of=zerofill bs=512K
This file will fill up all free space and forcing the system to overwrite all the free space with zero in it.
II. We then remove the file
rm -rf zerofill
III. Now we shutdown the VM. I prefer cloning the VM and then taking a OVA export of the cloned VM. However your milage may vary and you might have better results by directly taking the OVA export.
Hope these help in creating a slimmer version of OVA and help in reducing the overall capacity usage for downloading those files.