壓縮microsoft office word文檔docx

docx文檔的結構

docx文件本質上是壓縮包，包含Content_Types.xml定義的內容類型、.rels文件維護的關係、document.xml中的文檔內容、styles.xml中的樣式定義以及numbering.xml裡的列表樣式。這些組件協同工作，構建並呈現文檔的結構和樣式。更改文件名後綴docx到zip後，可以解壓縮成以下的目錄結構

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43


│  [Content_Types].xml
├─docProps
│      app.xml
│      core.xml
│      custom.xml
│
├─word
│  │  document.xml
│  │  endnotes.xml
│  │  fontTable.xml
│  │  footer1.xml
│  │  footnotes.xml
│  │  settings.xml
│  │  styles.xml
│  │  webSettings.xml
│  │
│  ├─media
│  │      image1.jpg
│  │      image10.emf
│  │      image11.emf
│  │      image12.emf
│  │      image13.png
│  │      image14.emf
│  │      image15.emf
│  │      image16.emf
│  │      image17.emf
│  │      image2.jpg
│  │      image3.jpg
│  │      image4.jpg
│  │      image5.png
│  │      image6.png
│  │      image7.emf
│  │      image8.png
│  │      image9.png
│  │
│  ├─theme
│  │      theme1.xml
│  │
│  └─_rels
│          document.xml.rels
│
└─_rels
        .rels

壓縮docx文件

佔用空間較大的一般主要是 word/media目錄下的文件，壓縮主要針對這些文件 docx暫時不支持最新的圖片壓縮格式比如 JPEG XL、AVIF、WebP 2 等, 所以還是要採用普通的jpg, png等常見格式

1. 解壓縮

把後綴改成zip, 即可解壓縮

1
2
3
4
5
6
7
8


def unzip(file):
    docname = file[0:-5]
    if os.path.exists(docname) :
        print('os.path.exists! remove!')
        shutil.rmtree(docname)

    with pyzipper.PyZipFile(file, "r") as zf:
        zf.extractall(docname)

2. 壓縮 jpg png 等文件

直接壓縮即可，可使用Caesium進行壓縮，有比較好的壓縮效果, 使用命令行工具

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


def compress_image(
    input_path: str,
    quality: int = 80
):
    command = 'caesiumclt.exe --same-folder-as-input --quality ' + str(quality) + ' ' + input_path
    print(command)

    try:
        os.system(command)
    except Exception as e:
        logging.error(f"An error occurred: {str(e)}")

quality 一般選擇50，壓縮後的圖片仍然效果很好，甚至可以選擇20也有很不錯的結果 quality 即使選擇較高的80, 壓縮後文件也會小很多

3. 壓縮emf文件

emf通常都比較大, 轉換為jpg png後, 通常可以小很多。可使用 imagemagick 進行格式轉換，然後再壓縮。由於改變了文件名的後綴 emf 到 jpg png, 需要修改 word_rels\document.xml.rels 文件

4. 打包

把修改後文件按原樣打包即可，壓縮方式選擇 ZIP_DEFLATED

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


def zip(folder, zipfile):
    print('zip:', folder, ' -> ', zipfile)
    with pyzipper.PyZipFile(zipfile, "w",compression=pyzipper.ZIP_DEFLATED) as zf:
        for root,dirs,files in os.walk(folder):
            for file in files:
                abs_path = os.path.join(root,file)
                rel_path = os.path.relpath(abs_path,folder)
                # print(abs_path, rel_path)
                zf.write(abs_path, rel_path)
    
    shutil.rmtree(folder)

5. 整個過程代碼

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


def compress_docx(indir, outdir):
    for root,dirs,files in os.walk(indir):
        #    print(root,dirs,files) 
        for file in files:
            if file.endswith('.docx'):
                docfile = os.path.join(root, file)
                unzip(docfile)
                docname = file[0:-5]
                imgpath = os.path.join(root, docname, 'word/media/')
                # print('imgpath=', imgpath)
                compress_image(imgpath, 50)

                outfolder =  os.path.join(outdir, os.path.relpath(root, indir))
                if not os.path.exists(outfolder):
                    os.mkdir(outfolder)
                zip(os.path.join(root, docname), os.path.join(outfolder, file))

總結

按照以上方式處理, quality=50, docx文件一般可以到原大小的1/3左右