[...] If anyone can recommend some PDF wrangling software that might deliver a smaller file size, I'd be interested.
Please excuse me if I'm missing something obvious here, but couldn't you just use a compression tool on the file?
Compression backgroundTo be able to answer both of these questions: some background about compression is needed.
There are lossless compression techniques (eg .zip, .rar, .gz, .lzma, .7z, png) and lossy compression techniques (jpeg, most video codecs [h264, mpeg*], most audio codecs [opus, mp3, aac]). Lossy is a lot better at saving space, but it sacrifices some details.
Lossy is pretty much a default expectation of modern computing and society, with lossless only used where any loss would be inappropriate (eg text files, written documents).
Lossy compressors are very situation specific so I won't go into their details here. Most have variable control: more compression (more detail loss) or less compression (less detail loss). You may have encountered this when saving as JPEG.
Lossless compressors work by looking for repeating patterns in your files and replacing these repeating patterns with just one copy (plus some references). This is how archive formats like .zip, .rar and .7z work.
Regardless of what compression technique you use: the data bits and bytes in the output compressed file should look mostly indistinguishable from random data. This means that there are no obvious patterns left that can be exploited for further compression. ie compressing a file multiple times will not make it smaller (with some technical exceptions), only one layer of compression should be applied or needed (again with some weird exceptions, not relevant here).
Initial inspectionNow let's look at the specific PDF file: Goldstar_OS-7020_Service_Manual.pdf (19MiB). I'll start my inspecting by extracting all of the images using
pdfimages (part of poppler-utils):
$ pdfimages ../GoldStar.pdf -all ex
$ ls
ex-000.jp2 ex-039.jpg ex-078.jpg ex-117.jpg ex-156.jpg ex-195.jpg ex-234.jpg ex-273.jpg
ex-001.jpg ex-040.jpg ex-079.jpg ex-118.png ex-157.jpg ex-196.jpg ex-235.jpg ex-274.jpg
ex-002.jpg ex-041.jpg ex-080.jpg ex-119.jpg ex-158.jpg ex-197.jpg ex-236.jpg ex-275.jpg
ex-003.jpg ex-042.jpg ex-081.jpg ex-120.jpg ex-159.jpg ex-198.jpg ex-237.jpg ex-276.jpg
ex-004.jpg ex-043.jpg ex-082.jpg ex-121.jpg ex-160.jpg ex-199.jpg ex-238.jpg ex-277.jpg
ex-005.jpg ex-044.jpg ex-083.jpg ex-122.jpg ex-161.jpg ex-200.jpg ex-239.jpg ex-278.jpg
ex-006.jpg ex-045.jpg ex-084.jpg ex-123.jpg ex-162.jpg ex-201.jpg ex-240.jpg ex-279.jpg
ex-007.jpg ex-046.jpg ex-085.jpg ex-124.png ex-163.jpg ex-202.jpg ex-241.jpg ex-280.jpg
ex-008.jpg ex-047.jpg ex-086.jpg ex-125.jpg ex-164.png ex-203.jpg ex-242.jpg ex-281.jpg
ex-009.jpg ex-048.jpg ex-087.jpg ex-126.jpg ex-165.jpg ex-204.jpg ex-243.jpg ex-282.jpg
ex-010.jpg ex-049.jpg ex-088.jpg ex-127.jpg ex-166.jpg ex-205.jpg ex-244.jpg ex-283.jpg
ex-011.jpg ex-050.jpg ex-089.jpg ex-128.jpg ex-167.jpg ex-206.jpg ex-245.jpg ex-284.jpg
ex-012.jpg ex-051.jpg ex-090.jpg ex-129.jpg ex-168.jpg ex-207.jpg ex-246.jpg ex-285.jpg
ex-013.jpg ex-052.jpg ex-091.jpg ex-130.jpg ex-169.jpg ex-208.jpg ex-247.jpg ex-286.jpg
ex-014.jpg ex-053.jpg ex-092.jpg ex-131.jpg ex-170.jpg ex-209.jpg ex-248.jpg ex-287.jpg
ex-015.jpg ex-054.jpg ex-093.jpg ex-132.jpg ex-171.jpg ex-210.jpg ex-249.jpg ex-288.jpg
ex-016.jpg ex-055.jpg ex-094.png ex-133.jpg ex-172.jpg ex-211.jpg ex-250.jpg ex-289.jpg
ex-017.jpg ex-056.jpg ex-095.jpg ex-134.jpg ex-173.jpg ex-212.jpg ex-251.jpg ex-290.jpg
ex-018.jpg ex-057.jpg ex-096.jpg ex-135.jpg ex-174.jpg ex-213.jpg ex-252.jpg ex-291.jpg
ex-019.jpg ex-058.jpg ex-097.jpg ex-136.jpg ex-175.jpg ex-214.jpg ex-253.jpg ex-292.jpg
ex-020.jpg ex-059.jpg ex-098.jpg ex-137.jpg ex-176.jpg ex-215.jpg ex-254.jpg ex-293.jpg
ex-021.jpg ex-060.jpg ex-099.jpg ex-138.jpg ex-177.jpg ex-216.jpg ex-255.jpg ex-294.jpg
ex-022.jpg ex-061.jpg ex-100.jpg ex-139.png ex-178.jpg ex-217.jpg ex-256.jpg ex-295.jpg
ex-023.jpg ex-062.jpg ex-101.jpg ex-140.jpg ex-179.jpg ex-218.jpg ex-257.jpg ex-296.jpg
ex-024.jpg ex-063.jpg ex-102.jpg ex-141.jpg ex-180.jpg ex-219.jpg ex-258.jpg ex-297.jpg
ex-025.jpg ex-064.jpg ex-103.jpg ex-142.jpg ex-181.jpg ex-220.jpg ex-259.jpg ex-298.jpg
ex-026.jpg ex-065.jpg ex-104.jpg ex-143.jpg ex-182.jpg ex-221.jpg ex-260.jpg ex-299.jpg
ex-027.jpg ex-066.jpg ex-105.jpg ex-144.jpg ex-183.jpg ex-222.jpg ex-261.jpg ex-300.jpg
ex-028.jpg ex-067.jpg ex-106.jpg ex-145.jpg ex-184.jpg ex-223.jpg ex-262.jpg ex-301.jpg
ex-029.jpg ex-068.jpg ex-107.jpg ex-146.jpg ex-185.jpg ex-224.jpg ex-263.jpg ex-302.jpg
ex-030.jpg ex-069.jpg ex-108.jpg ex-147.jpg ex-186.jpg ex-225.jpg ex-264.jpg ex-303.jpg
ex-031.jpg ex-070.jpg ex-109.png ex-148.jpg ex-187.jpg ex-226.jpg ex-265.jpg ex-304.jpg
ex-032.jpg ex-071.jpg ex-110.jpg ex-149.jpg ex-188.jpg ex-227.jpg ex-266.jpg ex-305.jpg
ex-033.jpg ex-072.jpg ex-111.jpg ex-150.jpg ex-189.jpg ex-228.jpg ex-267.jpg ex-306.jp2
ex-034.jpg ex-073.jpg ex-112.jpg ex-151.jpg ex-190.jpg ex-229.jpg ex-268.jpg ex-307.jp2
ex-035.jpg ex-074.jpg ex-113.jpg ex-152.png ex-191.jpg ex-230.jpg ex-269.jpg ex-308.jp2
ex-036.jpg ex-075.jpg ex-114.jpg ex-153.jpg ex-192.jpg ex-231.jpg ex-270.jpg
ex-037.jpg ex-076.jpg ex-115.jpg ex-154.jpg ex-193.jpg ex-232.jpg ex-271.jpg
ex-038.jpg ex-077.jpg ex-116.jpg ex-155.jpg ex-194.jpg ex-233.jpg ex-272.jpg
It looks like this file is nothing but jpegs and the occasional
jp2. That means to further (re)compress this file: what we really need to do is focus on the jpegs. The PDF is nothing but a wrapper around them. This is basically always the case for scanned PDFs.
We have several options to reduce the file sizes of images:
- Reduce their size (resolution)
- Reduce their bitdepth (number of colours)
- Increase their compression level (throw out more data during lossy/jpeg compression)
- Change the lossy codec to something else (eg from jpeg to jp2)
Sidenote: file quirks & better softwareUnfortunately for us each page of this PDF is not a single jpeg. Each page has been split into multiple jpegs that are aligned together in a grid:


Some software does this, it's quite annoying

To work around this: we're going to use the
imagemagick tools from now on. These are smart enough to treat each page as a single image. This particular software is very common and well known in the *nix and web-development worlds (many websites use it behind the scenes for image processing), but not well known in the Windows world, even though there is a Windows version available. The world is weird, this software is the
duck's guts for anything that requires mass-editing of multiple files simultaneously.
My method(1) Remove the very first and last pages. It's pretty to see the paper texture from the original manual cover, but it's all undulating and hard to compress. The second page of the PDF is the same as the cover any, just in black and white.
(2) Convert everything to greyscale. There is no point having colour anywhere in this document, so let's not make the compressor (eg jpeg) think it has to preserve it. Strip it all out.
(3) Remove ghosted text from other pages by fiddling with brightness/levels. I'm talking about this stuff (emphasized via editing):

You can't (normally) see this and it's useless, so let's get rid of it. The fact it's still there in the image means the last person's compressor was wasting space trying to keep it.
(4) Avoid reducing the resolution. Resolution is nice, esp in technical documents. There's nothing worse than a blurry technical diagram. The whole point of these documents is to make people happy, not grumpy.
(5) Try several different compressors (jpeg, jpeg2000, png, etc). PDFs can house quite a few different formats fine.
First attempt: adjust levels, greyscale, jpeg
$ mkdir temp1
$ magick convert -density 100 Goldstar_OS-7020_Service_Manual.pdf -set colorspace Gray -level '25%,75%,0.3' -quality 70 'temp1/page%03d.jpeg'
$ rm temp1/page000.jpeg temp1/page099.jpeg
$ du -sh temp1/
9.1M
9.1MiB is not bad, but we can be squeezier if we're clever.
Second attempt: adjust levels, greyscale, png
$ convert -density 100 Goldstar_OS-7020_Service_Manual.pdf -set colorspace Gray -level '25%,75%,0.4' 'temp2/page%03d.png'
$ rm temp2/page000.png temp2/page099.png
$ du -sh temp2/
12M
Eep, wrong way. This codec doesn't seem useful with this sort of image data.
... but what if we massage the image into something that png would actually like? Eg black and white (no greyscale)?
Third attempt: threshold into black and white, png
$ convert -density 100 Goldstar_OS-7020_Service_Manual.pdf -set colorspace Gray -alpha off -auto-threshold OTSU 'temp3/page%03d.png'
$ rm temp3/page000.png temp3/page099.png
$ du -sh temp3/
1.7M

I think the resolution is a bit low however, so I up it to
-density 200 and try again:
$ du -sh temp4
3.8M
Not as good filesize wise, but the diagrams are a lot more readable.