压缩microsoft office word文档docx

docx文档的结构

docx文件本质上是压缩包，包含Content_Types.xml定义的内容类型、.rels文件维护的关系、document.xml中的文档内容、styles.xml中的样式定义以及numbering.xml里的列表样式。这些组件协同工作，构建并呈现文档的结构和样式。更改文件名后缀docx到zip后，可以解压缩成以下的目录结构

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43


│  [Content_Types].xml
├─docProps
│      app.xml
│      core.xml
│      custom.xml
│
├─word
│  │  document.xml
│  │  endnotes.xml
│  │  fontTable.xml
│  │  footer1.xml
│  │  footnotes.xml
│  │  settings.xml
│  │  styles.xml
│  │  webSettings.xml
│  │
│  ├─media
│  │      image1.jpg
│  │      image10.emf
│  │      image11.emf
│  │      image12.emf
│  │      image13.png
│  │      image14.emf
│  │      image15.emf
│  │      image16.emf
│  │      image17.emf
│  │      image2.jpg
│  │      image3.jpg
│  │      image4.jpg
│  │      image5.png
│  │      image6.png
│  │      image7.emf
│  │      image8.png
│  │      image9.png
│  │
│  ├─theme
│  │      theme1.xml
│  │
│  └─_rels
│          document.xml.rels
│
└─_rels
        .rels

压缩docx文件

占用空间较大的一般主要是 word/media目录下的文件，压缩主要针对这些文件 docx暂时不支持最新的图片压缩格式比如 JPEG XL、AVIF、WebP 2 等, 所以还是要采用普通的jpg, png等常见格式

1. 解压缩

把后缀改成zip, 即可解压缩

1
2
3
4
5
6
7
8


def unzip(file):
    docname = file[0:-5]
    if os.path.exists(docname) :
        print('os.path.exists! remove!')
        shutil.rmtree(docname)

    with pyzipper.PyZipFile(file, "r") as zf:
        zf.extractall(docname)

2. 压缩 jpg png 等文件

直接压缩即可，可使用Caesium进行压缩，有比较好的压缩效果, 使用命令行工具

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


def compress_image(
    input_path: str,
    quality: int = 80
):
    command = 'caesiumclt.exe --same-folder-as-input --quality ' + str(quality) + ' ' + input_path
    print(command)

    try:
        os.system(command)
    except Exception as e:
        logging.error(f"An error occurred: {str(e)}")

quality 一般选择50，压缩后的图片仍然效果很好，甚至可以选择20也有很不错的结果 quality 即使选择较高的80, 压缩后文件也会小很多

3. 压缩emf文件

emf通常都比较大, 转换为jpg png后, 通常可以小很多。可使用 imagemagick 进行格式转换，然后再压缩。由于改变了文件名的后缀 emf 到 jpg png, 需要修改 word_rels\document.xml.rels 文件

4. 打包

把修改后文件按原样打包即可，压缩方式选择 ZIP_DEFLATED

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


def zip(folder, zipfile):
    print('zip:', folder, ' -> ', zipfile)
    with pyzipper.PyZipFile(zipfile, "w",compression=pyzipper.ZIP_DEFLATED) as zf:
        for root,dirs,files in os.walk(folder):
            for file in files:
                abs_path = os.path.join(root,file)
                rel_path = os.path.relpath(abs_path,folder)
                # print(abs_path, rel_path)
                zf.write(abs_path, rel_path)
    
    shutil.rmtree(folder)

5. 整个过程代码

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


def compress_docx(indir, outdir):
    for root,dirs,files in os.walk(indir):
        #    print(root,dirs,files) 
        for file in files:
            if file.endswith('.docx'):
                docfile = os.path.join(root, file)
                unzip(docfile)
                docname = file[0:-5]
                imgpath = os.path.join(root, docname, 'word/media/')
                # print('imgpath=', imgpath)
                compress_image(imgpath, 50)

                outfolder =  os.path.join(outdir, os.path.relpath(root, indir))
                if not os.path.exists(outfolder):
                    os.mkdir(outfolder)
                zip(os.path.join(root, docname), os.path.join(outfolder, file))

总结

按照以上方式处理, quality=50, docx文件一般可以到原大小的1/3左右