readthedocs.org: PDF not showing some traditional Chinese characters

Details

Sorry to open this issue, but I have read a lot of the related issues and have googled a lot but still cannot get it fixed.

I am using the method in #5453 to build the Latex PDF for zh_TW. The local build is fine, all things as expected, but the remote build has some characters missing, e.g., “換”, “佈” (basically a funny PDF).

Here is the setting in conf.py.

latex_engine = 'xelatex'
latex_use_xindy = False

latex_elements = {

    'papersize': 'a4paper',

    'pointsize': '10pt',

    'preamble': r'''

    \usepackage[UTF8]{ctex}

    \usepackage{float}

    \usepackage{graphicx}

    \usepackage{indentfirst}
    \setlength{\parindent}{2em}

    ''',

    'figure_align': 'H',
}

I have tried not using ctex but just xeCJK with a few different fonts but still not working.

By the way, the simplified Chinese translation is all correct (I use just xeCJK for it). Also, the HTML is fine with either language.

Expected Result

All traditional Chinese characters display correctly.

Actual Result

Some traditional Chinese characters, e.g., “換”, “佈”, are not displayed (the font used on the server does not have them?)

About this issue

  • Original URL
  • State: closed
  • Created 5 years ago
  • Comments: 23 (9 by maintainers)

Most upvoted comments

Hi @humitos , thanks for replying. I dig into the ctex manual and think I might have located the cause (the default Fandol font).

In here, I would:

  • show the comparisons of local PDF and remote PDF with images
  • attach MWEs to reproduce the issue locally
  • quote possible ctex solution from the manual
  • report the fixing progress to date (on going…)
  • summrise the stuff so far and reference seemingly related issuses from the ctex repo

Detailed Comparisons

Some examples of missing characters are shown in the screenshots below. The missing characters are rendered as “F” in squares.

  • Missing characters “換” and “佈” (title page)

Expected:

  • Missing character “說” (page 6)

Excepted:

  • Missing characters “註” and “儘”

Expected:

I uploaded a local PDF build for your reference: latex2img/remote_build_pdf_issues/latex2img_local_build.pdf

MEWs for reproducing this issue locally

Basically, you need to force the Fandol font, otherwise ctex would select fonts based on the OS (ctex manual page 6, table 3, section 4.3).

MWE using ctex:

“錄”, “換” and “註” are missing.

\documentclass[fontset = none]{book}
\usepackage[UTF8]{ctex}
\ctexset{fontset = fandol}
\begin{document}

    Using \textbf{ctex} with \textbf{fontset = fandol} to reproduce the issue locally.\newline

    The 3 lines below are forced FandolHei (characters missing).\newline

    \textbf{目錄}\\

    \textbf{轉換}\\

    \textbf{備註}\\

\end{document}

You can use xelatex with fandol to reproduce it as well:

\documentclass[a4, 10pt]{article}

\usepackage{xeCJK}
% force FandolHei, which is causing problem
\setCJKmainfont[BoldFont=Source Han Serif TC]{FandolHei}

\begin{document}

Using \textbf{xelatex} and \textbf{xeCJK} with forced \textbf{FandolHei} to reproduce the issue locally.\newline

The 3 lines below are forced FandolHei (characters missing).\newline

目錄\\

轉換\\

備註\\

The 3 lines below are forced Source Han Serif TC (characters correct).\newline

\textbf{目錄}\\

\textbf{轉換}\\

\textbf{備註}\\

\end{document}

Suspected Cause

I think this is because of the Fandol fonts which ctex defaults to when it detects the OS is neither Mac or Windows (manual page 6, table 3).

This table basically says, when using xelatex, for Mac OS X, it uses the HuaWen font family by default; for Window Vista and plus, it defaults to ZhongYi family + Microsoft YaHei; for Window XP and minus, it defaults to ZhongYi family; for others, it defaults to the Fandol family.

The two MWEs above try to force the Fandol fonts to reproduce the problem and have reproduced successfully.

I think this may be something to deal the implementation of Fandol , especially how bold fonts are implemented, either real bold or not.

Possible Solution

I think there may be two solutions.

  • Using ctex natively solution. ctex allows users to explicitly setting fonts. This is shown in Example 5 in page 7. The code would require you to define documentclass options, by setting [fontset = none] in the documentclass and then set it again in the premable.
\documentclass[fontset = none]{ctexart}
\ctexset{fontset = founder}
\begin{document}在文档类选项中声明\verb|fontset = none|,随后在导言区用\verb|\ctexset|指定字体。
\end{document}
  • Use xeCJK package and use \setCJKmainfont{<available font>} to use the correct font.

Progress To Date

I tried method 1 by using 'extraclassoptions': r'fontset = none', and \ctexset{fontset = ubuntu}, still not working (got worse actually). The settings in conf.py is:

latex_engine = 'xelatex'
latex_use_xindy = False

latex_docclass = {
   'manual': 'ctexbook'
}

latex_elements = {

    'papersize': 'a4paper',

    'pointsize': '10pt',

    'extraclassoptions': r'fontset = none',

    'preamble': r'''

    \usepackage{ctex}
    \ctexset{fontset = ubuntu}

    \usepackage{float}

    \usepackage{graphicx}

    \usepackage{indentfirst}
    \setlength{\parindent}{2em}

    ''',

    'figure_align': 'H',
}

Summary So Far

  • the cause seems to be ctex defaulting to Fandol font family (especial the FandolHei for bold) on the remote server
  • this issue can be reproduce locally on a Windows machine by forcing ctex or xeCJK to use FandolHei (ctex on Windows defaults to Microsoft YaHei and will not have this problem)
  • forcing ctex to use fontset = ubuntu made things worse (ubuntu CJK fonts not installed?)
  • we should be able to solve this problem by either making the default Ubuntu CJK fonts used by ctex available on the server (slim chance?) or use xeCJK to force readily available CJK fonts
  • this issue is specific to traditional Chinese, simplified Chinese seems ok (Fandol seems to have quite a bit of them)
  • traditional Chinese can be really difficult to get right since TW has its own set while HK also has its own set, and mainland China also has a new set (believe me…), and I am not counting other places like Japan, Singapore (usually some partly simplified traditionals, e.g. “麵” -> “麺”, note that the latter is the partly simplified version)
  • this issue and this one from the ctex repo confirm that Fandol has a quite a bit of TC fonts missing

Oops, I forgot to change the image name after copying.

Now everything works and the PDF output looks much better with the new set of fonts. Thank you for the effort!

Thanks for the prompt reply!

I still want to know if there is something that Read the Docs can do to help here and have a fully working PDF with all the Chinese characters (HK, TW and simplified). I understood that you can build the PDF in a perfect way in your local computer, so why we can’t on RTD?

I found that there is a Ubuntu package fonts-noto-cjk (1:20170601+repack1-2) https://packages.ubuntu.com/source/bionic/fonts-noto-cjk which seems to be an repackaging of of Google’s Noto CJK family. Since this is just Adobe’s Source CJK in a different name (which is the one I use locally), it may very well be able to display all Chinese characters (well, all those normal people use, the full Unicode set is a bit over the top), regarless of the regional differences.

if the user can accept a sans only PDF, then use \setCJKmainfont{Droid Sans Fallback} with either ctex or xeCJK.

I understand that this seems the preferred way to suggest to our users, is that correct? At least it will have all the characters on their places and the PDF will build completely.

I think for most users, the substance comes before the style, and therefore this may be the preferred solution. Though some users may need the serif and italic for their reasons. Also, full sans is kinda valiate the typset customs for the Chinese language, but that should be much less of an issue.

to get rid of all the “tofu”, we need a “Noto” font family

Does this exist? If so, we can install it in our server and make your PDF happy 😄

Google’s distribution of Adobe’s Source CJK series is named “Noto CJK” for that reason 😄 (for the overwhelming majority anyway).

Since I ran into this trap (specific to remote readthedocs.org PDF build), I propose to expand #5453 a bit. My proposal would use xeCJK instead of ctex.

Would you feel comfortable to make this changes by yourself and open a Pull Request? It seems that you have ton of experience here and I’m sure you will update it way better than myself.

Although, if these setup is very complex or does not cover most of the cases, we may want to keep the “if the user can accept a sans only PDF” solution by default, but expand the guide with this more specific solution for these particular cases.

Agreed. I hope the user would at least understand a bit about the difficulty of typesetting CJK. Perhaps, it should cover not just Chinese but also Japanese and Korean, which may make it quite complicated.

I think Japanese may even be more difficult to get right due to the mixture of Kanji (Chinese characters), Hiragana (consider them as lowercase phonetic syllabary) and Katakana (consider them as uppercase phonetic syllabary).

Korean should be consistent, since they have made quite an effort to get rid of the Chinese language after Japanese rule. Though they do use Chinese characters in some cases, these cases are quite limited.

Anyway, I do think the Noto CJK should be able to solve CJK characters problem in most cases. It actually maxes out the characters that can be placed in OTF. I just cannot say it is a sliver bullet until I can verify it in full or see other reliable reports.

The thing with Noto though, is that the filenames can vary depends on distributions or platform (since it is open source, people are free to repackage it and thus the problem). This introduce extra complication for \setCJKmainfonts{} since it needs the exact name.

For PR, I would like to test a bit more, so that it could be more definitive. I also hope to find a vlid solution for all CJK and not just TC and SC (help wanted!)

@blueset wow! I’m very happy reading that 😄 --Thank you a lot for helping us debugging this issue and make Read the Docs better and improve our support with other fonts 😃

If anything is missing here, I’d say that we can improve our Documentation Guide mentioning how these fonts can be configured but I think we can close this issue now.

@blueset actually, 7.0 is our current testing image (as it name says, it’s for testing purposes only) and it would be awesome if you have some time and try to use these fonts from there and letting us know that it works 😃

you can put build: image: testing in your configuration file to try it out

@blueset THANKS, this is amazing!

I’ve already opened a PR to install the package fonts you mentioned in a previous comment.

I can’t guarantee that we are going to include these preambles by default on a Read the Docs build because they will probably need a lot of testing (and I’m not an expert on this topic to can manage it) but I’d like to add them as suggestion in our current guide https://docs.readthedocs.io/en/stable/guides/pdf-non-ascii-languages.html or an appendix of it.

I really appreciate the work that all of you have done in this topic and I hope we can manage in a better way all of these languages at Read the Docs 🌏

I have tested Simplified Chinese (zh-hans, zh_CN on RTD), Traditional Chinese (zh-trad, zh_TW on RTD) and Japanese (ja). I came out with the following config for each of the languages:

zh-hans

latex_elements = {
    "preamble": r"""
\usepackage[AutoFallBack=true]{xeCJK}
\setCJKmainfont{Noto Serif CJK SC}[Language=Chinese Simplified, BoldFont={* Bold}, ItalicFont=AR PL KaitiM GB]
\setCJKsansfont{Noto Sans CJK SC}[Language=Chinese Simplified, BoldFont={* Bold}, ItalicFont=AR PL KaitiM GB]
\setCJKmonofont{Noto Sans CJK SC}[Language=Chinese Simplified, BoldFont={* Bold}, ItalicFont=AR PL KaitiM GB]
\setCJKfallbackfamilyfont{\CJKrmdefault}[AutoFakeBold]{{HanaMinA},{HanaMinB}}
\setCJKfallbackfamilyfont{\CJKsfdefault}[AutoFakeBold]{{HanaMinA},{HanaMinB}}
\setCJKfallbackfamilyfont{\CJKttdefault}[AutoFakeBold]{{HanaMinA},{HanaMinB}}
"""
}

zh-hant (updated to solve # 2 below)

latex_elements = {
    "preamble": r"""
\usepackage[AutoFallBack=true]{xeCJK}
\setCJKmainfont{Noto Serif CJK TC}[Language=Chinese Traditional, BoldFont={* Bold}, ItalicFont=AR PL KaitiM Big5]
\setCJKsansfont{Noto Sans CJK TC}[Language=Chinese Traditional, BoldFont={* Bold}, ItalicFont=AR PL KaitiM Big5]
\setCJKmonofont{Noto Sans CJK TC}[Language=Chinese Traditional, BoldFont={* Bold}, ItalicFont=AR PL KaitiM Big5]
\setCJKfallbackfamilyfont{\CJKrmdefault}[AutoFakeBold]{{HanaMinA},{HanaMinB}}
\setCJKfallbackfamilyfont{\CJKsfdefault}[AutoFakeBold]{{HanaMinA},{HanaMinB}}
\setCJKfallbackfamilyfont{\CJKttdefault}[AutoFakeBold]{{HanaMinA},{HanaMinB}}
\xeCJKEditPunctStyle{quanjiao}{optimize-kerning=true}
"""
}
About zh-hant-hk RTD currently doesn’t tell Hong Kong and Taiwan variants of Traditional Chinese apart, this portion would not contribute much. I would still leave it here in case anyone needs it. (updated to solve # 2 below)
latex_elements = {
    "preamble": r"""
\usepackage[AutoFallBack=true]{xeCJK}
\setCJKmainfont{Noto Serif CJK TC}[Language=Chinese Traditional, BoldFont={* Bold}, ItalicFont=AR PL KaitiM Big5]  % Noto Serif CJK HK is not yet available in the Debian/Ubuntu package repository
\setCJKsansfont{Noto Sans CJK HK}[Language=Chinese Traditional, BoldFont={* Bold}, ItalicFont=AR PL KaitiM Big5]
\setCJKmonofont{Noto Sans CJK HK}[Language=Chinese Traditional, BoldFont={* Bold}, ItalicFont=AR PL KaitiM Big5]
\setCJKfallbackfamilyfont{\CJKrmdefault}[AutoFakeBold]{{HanaMinA},{HanaMinB}}
\setCJKfallbackfamilyfont{\CJKsfdefault}[AutoFakeBold]{{HanaMinA},{HanaMinB}}
\setCJKfallbackfamilyfont{\CJKttdefault}[AutoFakeBold]{{HanaMinA},{HanaMinB}}
\xeCJKEditPunctStyle{quanjiao}{optimize-kerning=true}
"""
}

ja

latex_engine = "uplatex"  # works with platex as well
latex_elements = {
    "preamble": r"""
\usepackage[uplatex,deluxe]{otf}
\usepackage[noto-otc]{pxchfon}
"""
}
To use with platex If platex is still to be used instead of uplatex for whatever reason:
latex_engine = "platex"  # works with platex as well
latex_elements = {
    "preamble": r"""
\usepackage[deluxe]{otf}
\usepackage[noto-otc]{pxchfon}
"""
}

* A third font is not needed in Japanese (like italics/Kai) as not much of a need is seen in Japanese typesetting.

Demo and testing

I’m not sure if there is a programmatic way to test if a PDF output contains the correct set of fonts for rendering. But for the sake of completeness, I have included a copy of TeX source, and PDF output I used to test these fonts.

These PDFs are produced on a readthedocs/build:5.0 docker container with extra fonts installed.

Some decisions and points in doubt

  1. OpenType feature ccmp in XeLaTeX.
    As you might have seen in the samples above, there is a long sequence reads “⿺辶⿳穴⿰月⿰⿲⿱幺長⿱言馬⿱幺長刂心” that doesn’t seem like Chinese. This is a feature unique to the Noto CJK/Source Han typefaces that replace the sequence into one (super complicated) character through the ccmp (Glyph Composition/Decomposition) GSUB feature. See this blog from Adobe for details. Some suggested that this works with XeLaTeX seemingly out of box, but I can get it to work on my script. @iruletheworld, do you have any experience on this?
  2. Punctuation style in zh-trad.
    Punctuation style is set to “plain” for zh-trad due to an awkward typesetting in default settings. This shouldn’t be much of an issue for general uses. This is potentially an issue with the xeCJK package. I have raised an issue there regarding this. Resolved.
  3. Use uplatex instead of platex for ja.
    According to sources [1] [2], uplatex is a variant of platex that supports Unicode (rather than the old JIS level 1). It thus works with a wider range characters that includes some “rare-but-not-so-rare” characters which often appears in names. This should be a drop-in change on the Sphinx level if user has not defined configurations otherwise (but don’t quote me on that).
  4. Multiple font weights.
    Despite xeCJK has multiple font weight support, no extra effort is made on that so as to align with the default behavior of other LaTeX setups. (otf + pxchfon comes with simple option to enable multiple weights.)

Unfortunately there isn’t much I can research on the Korean usages of TeX as I don’t speak their language. It would be much appreciated if anyone from the Korean TeX community can contribute their opinions on this.

I’m always open to any suggestions and opinions on this, especially from TeX users, and our friends speaking Chinese/Japanese/Korean. Let me know if there is any question.

An (unperfect) Solution

Ok, after much trial and error, I’ve found an acceptable solution, only to zh-hant (TW) and neither zh-hant (HK) nor zh-hans (therefore unperfect).

I now believe this is a font problem on the server since changing to Debian available fonts does work (to an extend). The available fonts I found are here.

If you go into the repo and use tag zh-hant_TW_passed_1.0.0 then you can examine the solution. I will explain the details below, including:

  • root cause of the problem
  • font problems with zh-hant (TW), zh-hant (HK) (they use different characters under the umbrella of traditional Chinese) and zh-hans
  • some proposed solutions
  • conclusions to date

Root Cause of the Problem

I believe the root cause of the problem is the Fandol font which ctex defaults to when the OS is neither Mac or Windows. Fandol is quite incomplete, especially for traditional Chinese.

Why ctex with fontset = unbuntu doesn’t work either?

This is because with fontset = unbuntu, ctex would try to use the WenQunYi family (fonts-wqy-zenhei and others). But this font family is no longer shipped with Ubuntu (e.g. 18.0.4). Therefor Latex will not be able to find the fonts needed.

Ok, what font then?

The Droid Sans Fallback works, but you don’t have serif with it (as it says on the tin already).

  • Minimum setting with Droid Sans Fallback using xeCJK:
latex_engine = 'xelatex'
latex_use_xindy = False

latex_elements = {

    'papersize': 'a4paper',

    'pointsize': '10pt',

    'preamble': r'''

    \usepackage{xeCJK}
    \setCJKmainfont{Droid Sans Fallback}

    '''
}

Droid Sans Fallback works with both zh-hant (TW) and zh-hant (HK).

But I DO want serif and Chinese italic (KaiTi, 楷体/楷體)

This is where the constraint comes in, as I have not found a serif font supports zh-hant (TW), zh-hant (HK) and zh-hans all three on Debian.

I only manage to get zh-hant (TW) working, but zh-hant (HK) and zh-hans will have missing characters.

  • The solution (inperfect): Use AR PL Mingti2L Big5 for CJK main font, AR PL KaitiM Big5 for italic, and Droid Sans Fallback for sans (AR fonts from Arphic Technology, i.e., 文鼎,Wén Dǐng in Chinese pinyin):
latex_engine = 'xelatex'
latex_use_xindy = False

latex_elements = {

    'papersize': 'a4paper',

    'pointsize': '10pt',

    'preamble': r'''

    \usepackage{xeCJK}
    \setCJKmainfont{AR PL Mingti2L Big5}[ItalicFont = AR PL KaitiM Big5]
    \setCJKsansfont{Droid Sans Fallback}
    '''
}

Note that you must use zh-hant (TW) characters, otherwise some characters would be missing. For example, “爲” is HK, while “為” is TW, and “为” is the simplified version of them. I recommend opencc for translation.

So, until Ubuntu ships some really good Chinese fonts by default (e.g., the Noto Han/Source Han family, which gets installed if you add the Chinese language to Ubuntu), I am stuck (I am not a fan of the AR family. But I love the Noto Han/Source Han family).

What about zh-hans (simplified Chinese)?

Surprisingly, I tried the GB versions of the AR family and it did not work (GB is “<ruby>国<rt>Guó</rt>标 <rt>Biāo</rt></ruby>”, meaning “National Standard”, not “Great Britain”, lol). So, you may be stuck with Fandol. But many characters seem to be ok.

What about zh-hant (HK) then?

Well, someone donate a font to Ubuntu? Maybe the HK gov. should do it? Lol. At the moment, you may be stuck with \setCJKmainfont{Droid Sans Fallback} and will lose all serif.

Why xeCJK instead of ctex?

xeCJK is newer and more flexible and needs fewer configs.

Proposal to expand #5453

Since I ran into this trap (specific to remote readthedocs.org PDF build), I propose to expand #5453 a bit. My proposal would use xeCJK instead of ctex.

  • For zh-hant (TW), in the conf.py, use the following options for Latex
latex_engine = 'xelatex'
latex_use_xindy = False

latex_elements = {

    'papersize': 'a4paper',

    'pointsize': '10pt',

    'preamble': r'''

    \usepackage{xeCJK}
    \setCJKmainfont{AR PL Mingti2L Big5}[ItalicFont = AR PL KaitiM Big5]
    \setCJKsansfont{Droid Sans Fallback}
    '''
    }    

AR PL Mingti2L Big5 is the main font as in serif/宋体/宋體/明體; AR PL KaitiM Big5 is the italic/KaiTi/楷体/楷體; Droid Sans Fallback is the sans serif/无衬线/無襯線.

  • For zh-hans, you may get away with Fandol (default to on readthedocs.org remote)
latex_engine = 'xelatex'
latex_use_xindy = False

latex_elements = {

    'papersize': 'a4paper',

    'pointsize': '10pt',

    'preamble': r'''

    \usepackage{xeCJK}
    '''
}

If not, add \setCJKmainfont{Droid Sans Fallback} under \usepackage{xeCJK}. You will lose all serif but the characters will show on the remote built PDF.

Conclusions on this stage

  • zh-hans seems to work just all right with Fandol
  • zh-hant (TW) needs the AR family with Droid Sans Fallback (the AR family is not handsome though)
  • zh-hant (HK) can work with Droid Sans Fallback but losing all serif
  • Ubuntu, please ship your next version with Noto CJK/Source CJK
  • Local build? Use whatever you like. xelatex with xeCJK is brilliant

I think I more or less get to the bottom of this issue and it can be closed now.