スマトラ探検(sumatra-master/vs2019)

はじめに

PDFは、コンテナ型のファイル形式であり、複数のページにベクトル線画やJPEG画像を格納している。 BERVにおいて、スキャナが生成したPDFファイルから目次を読み出し、必要とするページの画像情報を取り出す。
一方、CADソフトから出力したPDFファイルは、線画により構成されている。しかも、この部分はFlate圧縮されている。この部分の解凍処理を仕様書を資料としてゼロから構築することは手間がかかる。
スマトラPDFプロジェクトは、VS2019上のビルドの形でソースコードが公開されている。そこで、VS2019をセットアップした上でプロジェクトをビルドし、デバッガで追跡して処理手順を探索した。

背景

BERVプロジェクトにおいて、W2（壁面検査結果のテクスチャ画像作成）の中に、原始的なPDF入力プログラムを作成した。これは、スキャナが生成した図面読み取り結果を格納した、複数のJPEG画像から成るPDFファイルから、単品の画像を取り出し、複数のファイルに出力するためのものである。
一方、CADソフトから出力したPDFファイルにおいては、線画データがパックされている。線画データを解読し、例えばLSSG形式のLINEコマンドに変換することは一つの実装形態として考えられる。
線画データは、例えば次のように記述されている：
/Fo 24 Tf
1 0 0 1 70 770 Tm
(Line) Tj
50 650 m 150 750 l S
Et
このようなテキストデータが、圧縮されたバイナリデータとして格納されている。

経緯

①まず、BERVプロジェクトにおけるPDFファイル処理プログラム(pdf.cpp, 161002)を確認
②外部関数としてPDFファイルからLSSGファイルに変換するプロジェクトPDFを新規作成
③サンプルデータとして、kdb/BERV/mito/R階平面図.pdfを使用
④圧縮されているベクトルデータ部分をバイナリファイルとして出力
⑤VS2008（ディスク版）をW8マシンにセットアップし、ダウンロードしたzlibをビルド
⑥VS2019（無償版）をW8マシンにセットアップし、ダウンロードしたsumatra-pdfプロジェクトをビルド
⑦mupdfの実行形式をダウンロードし、テストデータの表示を確認
mupdfライブラリの関数は、pdf_xxxxx という名称である。
mupif.lib としてビルドされ、Sumatra-PDF プロジェクトからスタティックライブラリ参照される。
mupdfのみで簡単な実行形式がビルドされ、PDFファイルを表示することができる。
⑧fopenに至るコールスタックを調査
⑨レンダリング開始からinflate に至るコールスタックを調査
⑩fopenに続き、pdfファイルの全体構造を解析する処理を調査

ファイルを開く処理(2-13, 2-41)

スマトラプロジェクトにおいてはMFCは使用されていない。メインのSumatraPDF.cppにおいて、 LRESULT CALLBACK WinProcFrame(HWND hwnd, UINT msg, WPARAM wp, LPARAM lp)
static UINT_PTR CALLBACK FileOpenHook()
case CmdOpen から、OnMenwin)uOpen()
→LoadDocument(LoadArgs &args)
→LoadDocIntoCurrentTab(args, ctrl, nullptr) →Synchronizer::Create() [PdfSync.cpp l=124]
・・・
engine->ExtractPage
・・・
[EngineFzUtil.cpp]メモ2-41
fz_open_file2(fz_context* ctx, const WCHAR* filepath)
[FileUtil.cpp]
file::ReadFileWithAllocator (1-4)
→FILE*OpenFILE(const WCHAR*path) [FileUtil.c]
→_wfopen(path,L"rb")

サンプルPDFファイルのプロパティ

SumatraPDF-3.2-64.exe でR階平面図を開き、プロパティを表示する(2-10)
作者　mu112593
作成日時 161125 18:15:47
変更日時　〃
PDF出力　PDF Printer/www.bullzip.com/CP/Freeware Edition
PDFバージョン　1.5 ファイルサイズ　46.87KB(47,998B)
ページ数　１
ページサイズ　21.0 * 29.7 cm(A4)

ファイル全体構造の解析(2-61)

ファイルを開いた後、全体構造を解析し、ページ一覧を取得する。
doc = pdf_new_document(ctx,file) 関数[pdf-xref.c]が、ファイルの全体構造を解析する。
pdf_read_start_xref(ctx, doc)関数[同]が、末尾のセクション一覧を開く。

参照関係(0-12) トレーラ辞書（ファイルの末尾）
ドキュメントカタログ（/Root)
ページツリー(/Pages)
ページオブジェクト(/Kids)
フォント(/Resources); ストリーム(/Contents)
R階　トレーラ辞書(0-11)
xref (キーワード：相互参照デーブルであることを示す)
0 6 (0番から5番まで)
0000000000 65535 f (フリーエントリー）
0000000015 00000 n (エントリー, 15番地から、バージョン０)
0000000066 00000 n
0000000223 00000 n
0000000125 00000 n
0000000329 00000 n
各エントリーは、改行含め常に20バイト

レンダリング

各ページにおける圧縮コンテンツの解凍は、ページの表示（レンダリング）段階で行われている(2-23)
RenderCacheThread
→EnginePdf::ExtractPageText
→pdf_run_page_contents_with_usage[pdf-run.c]
→pdf_process_contents[pdf-interpret.c]
→pdf_process_stream
→pdf_process_keyword
・・・
→pdf_load_enbedded_font （例えば）

inflateを呼び出す３パターンがある

①open_contents_streamからのスレッド
②load_image_streamからのスレッド
③open_stream_numberからのスレッド

inflate[ext\zlib\inflate.c]の呼び出し状況(2-19,2-48)

LOAD()マクロにより、レジスタ変数を使い、処理速度を上げている。
inflateは状態関数であり、readが空になるかwriteが満杯になるまで一回で処理し、繰り返し呼び出される。
再び呼ばれた時、入出力ができれば、次の状態に進む
NEEDBITS()マクロは状態を評価し、続行するか戻るかを判定する
DROPBITS()マクロは、下位nビットをアキュムレータから掃き出す
INiTBITS()マクロは、アキュムレータをゼロクリアする
BITS(8)は、ストリームの次のバイトを返す
PULLBYTE()は、可能な入力を行う
ナド

pdf_lex で、lex_white の時、lex_byteを呼び出す。
これはマクロで、fz_read_byteを呼び出す。この関数は、stream.hで定義されている。

pdf_process_stream

この関数が呼び出された時には、streamは既に展開されている
stm->rp に解凍され展開されたコードがあり、pdf_lexにより繰り返し解釈される
0 12 0 0 0.12 0 0 cm\n
/R7 gs \n
4 w\n
0 0 0 RG \n
2365 5445 67\n
pdf_lex が識別するトークン(2-24)
PDF_TOK_CLOSE_ARRAY
PDF_TOK_REAL
PDF_TOK_INT
PDF_TOK_STRING
PDF_TOK_EOF
PDF_TOK_KEYWORD

PDF_TOK_ENDSTREAM
PDF_TOK_EOF
PDF_TOK_OPEN_ARRAY
PDF_TOK_OPEN_DICT
PDF_TOK_NAME
PDF_TOK_INT
PDF_TOK_REAL
PDF_TOK_STRING
pdf_lex(ctx,f,buf) の内部処理
→buf に識別した文字列を格納して返す
_ ならホワイト
% ならコメント
/ ならPDF_TOK_NAME
<< なら辞書を開くPDF_TOK_OPEN_DICT
< ならhex_string
>> なら辞書終了
[ なら配列開始
] なら配列終了
{ なら
} なら
[0-9]なら数字

stream の後

"stream" を検出すると、pdf_lex が、pdf_token_from_keyword関数[pdf-lex.c]を呼び出す。
すると、pdf_process_contents→pdf_open_contents_stream→pdf_open_image_stream
pdf_open_image_streamの中で、
→pdf_open_filter
→(fz_stream*)fz_open_image_decomp_stream
→fz_open_flated(ctx,tail,15)
というスレッドでストリームが開かれる。
この段階ではまだ、stm はNULL
stmメンバは、pdf_proces_contents函数の中で、
stm=pdf_open_contents_stream() によりセットされる。ストリームなら、 pdf_open_image_stream()[pdf-stream.c]に振る。
さらに、
pdf_open_filter()[pdf-stream.c] →fz_keep_stream(ctx,rstm)[stream-open.c]
が接続される。ここから、fz_keep_imp[context.h]に接続し、非圧縮のストリームの次を読むように準備する。
次に、
pdf_process_stream [pdf-interpret.c]
fz_read_all(ctx,stm,len) [stream_read.c]
が、バッファを作成し、冒頭部分を読み込む。
引き続き、stm->next(ctx,stm,max)でその先を読む。
nextメンバは関数へのポインタで、実装として
next_flated(ctx,stm,required) [mupdf/source/fitz/filter-flate.c]
が呼び出される。その中で
z_stream zp = &state->z;
code = inflate(zp,Z_SYNC_FLUSH);
が実行され、解凍が行われる（inflate はzlibの関数[inflate.c]）。
以下、逐次解凍しながら、解析処理を進めていく。

pdf_lex()は、lex_byte(ctx,f)関数を用いて１文字入力する。
lex_byteはマクロで、fz_read_byte(C,S)関数を呼び出す。 fz_read_byteは stream.h で関数定義されている。
c=stm->next(ctx,stm,1);
この next メンバは関数ポインタで、予めセットされている。
呼び出し段階でセットされているのは、next_flated関数である[mupdf/source/fitz/filter-flate.c]。
この関数は、
code = inflate(zp, Z_SYNC_FLUSH)　を呼び出す。

inflate(strm, flush)[ext\zlib\inflate.c]の処理内容
strm->next_in はバイナリ
strm->avail_in は数
処理後、
strm->next_out には、テキスト文字列
strm->avail_out には、文字列
この関数を呼び出した後、avail_outの先頭１文字を取り出してpdx_lex　に返す。
inflate函数の中では、for(;;)無限ループを回す。
ループjの中では、strm->state->mode でスイッチして各処理を行う。
/* Possible inflate modes between inflate() calls */
typedef enum {
HEAD = 16180, /* i: waiting for magic header */
FLAGS, /* i: waiting for method and flags (gzip) */
TIME, /* i: waiting for modification time (gzip) */
OS, /* i: waiting for extra flags and operating system (gzip) */
EXLEN, /* i: waiting for extra length (gzip) */
EXTRA, /* i: waiting for extra bytes (gzip) */
NAME, /* i: waiting for end of file name (gzip) */
COMMENT, /* i: waiting for end of comment (gzip) */
HCRC, /* i: waiting for header crc (gzip) */
DICTID, /* i: waiting for dictionary check value */
DICT, /* waiting for inflateSetDictionary() call */
TYPE, /* i: waiting for type bits, including last-flag bit */
TYPEDO, /* i: same, but skip check to exit inflate on new block */
STORED, /* i: waiting for stored size (length and complement) */
COPY_, /* i/o: same as COPY below, but only first time in */
COPY, /* i/o: waiting for input or output to copy stored block */
TABLE, /* i: waiting for dynamic block table lengths */
LENLENS, /* i: waiting for code length code lengths */
CODELENS, /* i: waiting for length/lit and distance code lengths */
LEN_, /* i: same as LEN below, but only first time in */
LEN, /* i: waiting for length/lit/eob code */
LENEXT, /* i: waiting for length extra bits */
DIST, /* i: waiting for distance code */
DISTEXT, /* i: waiting for distance extra bits */
MATCH, /* o: waiting for output space to copy string */
LIT, /* o: waiting for output space to write literal */
CHECK, /* i: waiting for 32-bit check value */
LENGTH, /* i: waiting for 32-bit length (gzip) */
DONE, /* finished check, done -- remain here until reset */
BAD, /* got a data error -- remain here until reset */
MEM, /* got an inflate() memory error -- remain here until reset */
SYNC /* looking for synchronization bytes to restart inflate() */
} inflate_mode;
サンプルデータＲ階の例では、
HEAD
TYPE
TABLE
LEN（数多く繰り返し）
の順番でスイッチがある。
stm構造体
pdf_lex(fz_context *ctx, fz_stream *f, pdf_lexbuf *buf); [pdf_lex.c]
typedef struct fz_stream fz_stream; [stream.h l.22]
VS2019エディタの「宣言」参照で検出されるstream.h におけるこのfz_streamの予告宣言は、トートロジーになっている。
同じ stream.h (l.236)で宣言されている以下が本体。
struct fz_stream
{
int refs;
int error;
int eof;
int progressive;
int64_t pos;
int avail;
int bits;
unsigned char *rp, *wp;
void *state;
fz_stream_next_fn *next;
fz_stream_drop_fn *drop;
fz_stream_seek_fn *seek;
};

z_streamの宣言は、zlib.h にある（下記）。
typedef struct z_stream_s {
z_const Bytef *next_in; /* next input byte */
uInt avail_in; /* number of bytes available at next_in */
uLong total_in; /* total number of input bytes read so far */

Bytef *next_out; /* next output byte will go here */
uInt avail_out; /* remaining free space at next_out */
uLong total_out; /* total number of bytes output so far */

z_const char *msg; /* last error message, NULL if no error */
struct internal_state FAR *state; /* not visible by applications */

alloc_func zalloc; /* used to allocate the internal state */
free_func zfree; /* used to free the internal state */
voidpf opaque; /* private data object passed to zalloc and zfree */

int data_type; /* best guess about the data type: binary or text
for deflate, or the decoding state for inflate */
uLong adler; /* Adler-32 or CRC-32 value of the uncompressed data */
uLong reserved; /* reserved for future use */
} z_stream;
stm 構造体の構築は、pdf_process_contents[pdf-interpret.c]で行われている。
この函数の中では、
stm = pdf_open_contents_stream(ctx, doc, stmobj); pdf_open_contents_stream函数[pdf-stream.c]は、ctx,objを調べ、
配列なら、pdf_open_object_arrayを、
ストリームなら、pdf_open_image_stream[pdf-stream.c]を、
それ以外ならpdf_open_memoryを呼び出す。

pdf_open_image_streamは、スタティック函数であるbuild_filter[pdf-stream.c] を呼び出す。
ストリームのタイプを判定し、圧縮・解凍を行う場合には、 fz_open_image_decomp_stream(ctx, chain, params, NULL);[compressed-buffer.c]を呼び出す。
圧縮タイプがFILTER_FLATEであれば、
fz_open_flated(ctx, tail, 15);[filter-flate.c]を実行
これは、
fz_new_stream(ctx, state, next_flated, close_flated);[stream-open.c]に再委託。
この函数は fz_stream メモリブロックをアロケートした上で、第二引数として受け取ったnext函数を割り付ける。
（以上）

ビルドの構成と依存関係
圧縮解凍を行っているzlib、pdfのデータ構造を扱うmupdf、アプリを構成するSumatraPDFの関係を解説する。
zlib は、SumatraPDF プロジェクトからは直接参照されない。mupdfを通じて間接的に参照されている。
zlibビルドは、zlib というスタティックライブラリを生成する。
mupdfビルドは、zlibを参照するとともに、外部依存関係として、zlib.h、zconf.hを参照する。
zlib.h は、zconf.h をインクルードしている。
Copyright (C) 1995-2017 Jean-loup Gailly and Mark Adler
mupdfが依存する他のライブラリは、freetype, gumbo, harfbuzz, jbig2dec, lcms2, libjpeg-turbo, mujs, openjpeg
zlib.h における函数プロトタイプ宣言には、「OF」というマクロが使われている。
ZEXTERN const char * ZEXPORT zlibVersion OF((void)); において、OF(args)　は、args または（）に読み替えられる。
zlibVersion(void) または、zlibVersion() に変換される。
以下の函数が宣言されている。
zlibVersion deflate deflateEnd inflate inflateEnd deflateSetDictionary deflateGetDictionary deflateCopy deflateReset deflateParams deflateTune deflateBound deflatePending deflatePrime deflateSetHeader inflateSetDictionary inflateGetDictionary inflateSync inflateCopy inflateReset inflateReset2 inflatePrime inflateMark inflateGetHeader inflateBack inflateBackEnd zlibCompileFlaghs compress compress2 compressBound uncompress uncompress2 gzdopen gzbuffer gzsetparams gzread gzwrite gzfwrite gzprintf gzputs gzgets gzputc gzgetc gzungetc gzflush gzrewind gzeof gzdirect gzclose gzclose_r gzclose_w gzerror gzclearerr adler32 adler32_z crc32 crc32_z deflateInit_ inflateInit_ deflateInit2_ inflateInit2_ inflateBackInit_ gzgetc_ gzopen gzseek gztell gzoffset daler32_combine crc32_combine zError inflateSyncPoint get_crc_table inflateUndermine inflateValidate inflateCodesUsed inflateResetKeep deflateResetKeep gzopen_w gzprintf
inflateInit2 と inflateEnd の処理を追いかける
①-1 pdf_process_conmtents
この中には入れ子がある。
②-1 pdf_load_image_stream
③-1 pdf_load_embedded_cmap
②-2
③-2
②-3
③-3
（ここで、①-1のサイクルが終わる）
①-2
①-3
入れ子の解凍で、外側の解凍は、pdf_process_contents[pdf-interpret.c]においてまず
pdf_open_contents_stream 以下の呼び出しでinflateInit2を実行する pdf_process_contents のレベルに一度戻り、そこからステップを進め、 pdf_process_stream から
pdf_load_simple_font のレベルまで呼び出す。
そこから inflateInit2, inflateループ, inflateEnd

レンダリング
解凍され得られたコマンド群からなる文字列は、ビットマップにレンダリングされる。
pdf_process_stream函数[pdf-interpret.c]の中で、pdf_lexループが回る。
数値（整数、浮動小数）、コマンドなどがスタックに蓄積される。
cst->stack
例えば、c (curve to)は３点のXY座標を引数に持つが、
まず３点の座標をスタックに積み、コマンドが積まれる。

実際にオブジェクトが生成される過程を探索すると、
static void pdf_run_c(fz_context *ctx, pdf_processor *proc, float x1, float y1, float x2, float y2, float x3, float y3) [pdf-op-run.c] の中で、
fz_curveto 函数[path.c]が呼び出される。この函数は、
push_cmd(ctx, path, FZ_CURVETO);
push_coord(ctx, path, x1, y1);
push_coord(ctx, path, x2, y2);
push_coord(ctx, path, x3, y3);
第二引数のpath は、pdf_processor *proc ->path である。
bmp を生成し、この上に描画？

PDF外部関数への組み込み
まず、lex()函数を作るとともに、pdf-lex.cをビルドに加え、関連する函数を個別に加えた。
cms1関連函数（カラー処理、ext/lcms2/src配下）については、全て加えた。
pdf_lex.cに引数として渡す、
fz_context *ctx,
fz_stream *f,
pdf_lexbuf *buf
については、正しく初期化されたメモリブロックを渡さないと、落ちる。

fz_context *ctx
ctx は、EnginePDF のメンバである。stm構造体の項で解説している。→
ctx を構築するスレッドは、
static Controller* CreateControllerForFile(const WCHAR* path, PasswordUI* pwdUI, WindowInfo* win) {
[SumatraPDF.cpp]から始まっており、この中で、
EngineBase* engine = CreateEngine(path, pwdUI, chmInFixedUI, ebookInFixedUI);
が実行されている。
EnginePdf* engine = new EnginePdf();
if (!engine->Load(path, pwdUI)) {
・・・・ [EnginePDF.cpp l.1910]
で、ファイルからエンジンが作成される（後述）。

EnginePDFクラスのctxメンバは、構築時、nullptrで初期化されている。
LoadFromStream函数の中で、
pdf_document* doc = pdf_open_document_with_stream(ctx, stm);
のスレッドで初期化され、スレッド中 EnginePdf.cpp(l.1909)において、
(1)engine = newEnginePdf();
(2)engine->Load(path,pwdUI);
が実行される。engine->Errorメンバは、(1)の中で初期化される。
具体的には、mupdf/fitz/context.c(l.159)において、
ctx = fz_new_context(nullptr, &fz_locks_ctx, FZ_STORE_DEFAULT);
これはマクロで、実体の fz_new_context_imp [context.c (l.159)]が呼び出される。
同じソース中の、fz_init_error_context函数で、errorメンバが初期化される。

fz_stream *f

struct fz_stream
{
int refs;
int error;
int eof;
int progressive;
int64_t pos;
int avail;
int bits;
unsigned char *rp, *wp;
void *state;
fz_stream_next_fn *next;
fz_stream_drop_fn *drop;
fz_stream_seek_fn *seek;
};
この中の、next メンバに、fread につなげることにより、ファイルを解析することができる。

余談：公開鍵と秘密鍵
公開鍵と秘密鍵を併用すると、
誰でも公開鍵で暗号化でき、私だけが秘密鍵で復号できる(https通信)
私が秘密鍵で暗号化し、誰でも公開鍵で復号できる（電子はんこ）

コード変換方法
公開鍵をE,N、秘密鍵をD,Nとする
C = P^E mod N　で暗号Cを作る。
P = C^D mod N で平文Pを作る。

鍵の生成方法
二つの素数p,q を用意する（例えば１７と２３）
N = p*q = 391
k1 * k2 = n(p-1)(q-1)+1 となる素数k1, k2を求め、秘密鍵、公開鍵とする（例えばn=2 でk1=15, k2=47)

説明
ある整数Mについて、 M^(n(p-1)(q-1)+1) mod N ≡ M mod N が成立する
フェルマーの第一

補足
この方法によるものをRSA暗号という。
他に楕円函数を用いた暗号化の方法もある。

日本語入力の方法
VS2019でATOKを使用する場合、変換候補が表示されないか、離れた場所に表示される
IMEを使用する場合も同様。
日本語変換候補のポップアップの位置は、変更することができない。ローマ字を打ち込んで暫くすると、あを四角で囲んだ選択メニューがポップアップ。スペースキーで候補を選択する。連想変換が表示される。

スマトラ探検記

はじめに

背景

経緯

ファイルを開く処理(2-13, 2-41)

サンプルPDFファイルのプロパティ

ファイル全体構造の解析(2-61)

レンダリング

inflateを呼び出す３パターンがある

inflate[ext\zlib\inflate.c]の呼び出し状況(2-19,2-48)

pdf_process_stream

stream の後

inflate(strm, flush)[ext\zlib\inflate.c]の処理内容

stm構造体

ビルドの構成と依存関係

inflateInit2 と inflateEnd の処理を追いかける

レンダリング

PDF外部関数への組み込み

fz_context *ctx

fz_stream *f

余談：公開鍵と秘密鍵

コード変換方法

鍵の生成方法

説明

補足

日本語入力の方法