readthedocs.org: PDF not showing some traditional Chinese characters
Details
- Read the Docs project URL: https://readthedocs.org/projects/latex2img/
- Build URL (if applicable): https://readthedocs.org/projects/latex2img/builds/9837446/ https://readthedocs.org/projects/latex2img/downloads/pdf/latest/
Sorry to open this issue, but I have read a lot of the related issues and have googled a lot but still cannot get it fixed.
I am using the method in #5453 to build the Latex PDF for zh_TW. The local build is fine, all things as expected, but the remote build has some characters missing, e.g., “換”, “佈” (basically a funny PDF).
Here is the setting in conf.py
.
latex_engine = 'xelatex'
latex_use_xindy = False
latex_elements = {
'papersize': 'a4paper',
'pointsize': '10pt',
'preamble': r'''
\usepackage[UTF8]{ctex}
\usepackage{float}
\usepackage{graphicx}
\usepackage{indentfirst}
\setlength{\parindent}{2em}
''',
'figure_align': 'H',
}
I have tried not using ctex
but just xeCJK
with a few different fonts but still not working.
By the way, the simplified Chinese translation is all correct (I use just xeCJK
for it). Also, the HTML is fine with either language.
Expected Result
All traditional Chinese characters display correctly.
Actual Result
Some traditional Chinese characters, e.g., “換”, “佈”, are not displayed (the font used on the server does not have them?)
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 23 (9 by maintainers)
Hi @humitos , thanks for replying. I dig into the
ctex
manual and think I might have located the cause (the defaultFandol
font).In here, I would:
ctex
solution from the manualctex
repoDetailed Comparisons
Some examples of missing characters are shown in the screenshots below. The missing characters are rendered as “F” in squares.
Expected:
Excepted:
Expected:
I uploaded a local PDF build for your reference:
latex2img/remote_build_pdf_issues/latex2img_local_build.pdf
MEWs for reproducing this issue locally
Basically, you need to force the
Fandol
font, otherwisectex
would select fonts based on the OS (ctex
manual page 6, table 3, section 4.3).MWE using
ctex
:“錄”, “換” and “註” are missing.
You can use
xelatex
withfandol
to reproduce it as well:Suspected Cause
I think this is because of the
Fandol
fonts whichctex
defaults to when it detects the OS is neither Mac or Windows (manual page 6, table 3).This table basically says, when using
xelatex
, for Mac OS X, it uses the HuaWen font family by default; for Window Vista and plus, it defaults to ZhongYi family + Microsoft YaHei; for Window XP and minus, it defaults to ZhongYi family; for others, it defaults to theFandol
family.The two MWEs above try to force the
Fandol
fonts to reproduce the problem and have reproduced successfully.I think this may be something to deal the implementation of
Fandol
, especially how bold fonts are implemented, either real bold or not.Possible Solution
I think there may be two solutions.
ctex
natively solution.ctex
allows users to explicitly setting fonts. This is shown in Example 5 in page 7. The code would require you to definedocumentclass
options, by setting[fontset = none]
in thedocumentclass
and then set it again in the premable.xeCJK
package and use\setCJKmainfont{<available font>}
to use the correct font.Progress To Date
I tried method 1 by using
'extraclassoptions': r'fontset = none',
and\ctexset{fontset = ubuntu}
, still not working (got worse actually). The settings inconf.py
is:Summary So Far
ctex
defaulting toFandol
font family (especial theFandolHei
for bold) on the remote serverctex
orxeCJK
to useFandolHei
(ctex
on Windows defaults toMicrosoft YaHei
and will not have this problem)ctex
to usefontset = ubuntu
made things worse (ubuntu CJK fonts not installed?)ctex
available on the server (slim chance?) or usexeCJK
to force readily available CJK fontsFandol
seems to have quite a bit of them)ctex
repo confirm thatFandol
has a quite a bit of TC fonts missingOops, I forgot to change the image name after copying.
Now everything works and the PDF output looks much better with the new set of fonts. Thank you for the effort!
Thanks for the prompt reply!
I found that there is a Ubuntu package
fonts-noto-cjk (1:20170601+repack1-2)
https://packages.ubuntu.com/source/bionic/fonts-noto-cjk which seems to be an repackaging of of Google’sNoto CJK
family. Since this is just Adobe’sSource CJK
in a different name (which is the one I use locally), it may very well be able to display all Chinese characters (well, all those normal people use, the full Unicode set is a bit over the top), regarless of the regional differences.I think for most users, the substance comes before the style, and therefore this may be the preferred solution. Though some users may need the serif and italic for their reasons. Also, full sans is kinda valiate the typset customs for the Chinese language, but that should be much less of an issue.
Google’s distribution of Adobe’s
Source CJK
series is named “Noto CJK” for that reason 😄 (for the overwhelming majority anyway).Agreed. I hope the user would at least understand a bit about the difficulty of typesetting CJK. Perhaps, it should cover not just Chinese but also Japanese and Korean, which may make it quite complicated.
I think Japanese may even be more difficult to get right due to the mixture of Kanji (Chinese characters), Hiragana (consider them as lowercase phonetic syllabary) and Katakana (consider them as uppercase phonetic syllabary).
Korean should be consistent, since they have made quite an effort to get rid of the Chinese language after Japanese rule. Though they do use Chinese characters in some cases, these cases are quite limited.
Anyway, I do think the
Noto CJK
should be able to solve CJK characters problem in most cases. It actually maxes out the characters that can be placed in OTF. I just cannot say it is a sliver bullet until I can verify it in full or see other reliable reports.The thing with
Noto
though, is that the filenames can vary depends on distributions or platform (since it is open source, people are free to repackage it and thus the problem). This introduce extra complication for\setCJKmainfonts{}
since it needs the exact name.For PR, I would like to test a bit more, so that it could be more definitive. I also hope to find a vlid solution for all CJK and not just TC and SC (help wanted!)
@blueset wow! I’m very happy reading that 😄 --Thank you a lot for helping us debugging this issue and make Read the Docs better and improve our support with other fonts 😃
If anything is missing here, I’d say that we can improve our Documentation Guide mentioning how these fonts can be configured but I think we can close this issue now.
@blueset actually, 7.0 is our current
testing
image (as it name says, it’s for testing purposes only) and it would be awesome if you have some time and try to use these fonts from there and letting us know that it works 😃@blueset THANKS, this is amazing!
I’ve already opened a PR to install the package fonts you mentioned in a previous comment.
I can’t guarantee that we are going to include these preambles by default on a Read the Docs build because they will probably need a lot of testing (and I’m not an expert on this topic to can manage it) but I’d like to add them as suggestion in our current guide https://docs.readthedocs.io/en/stable/guides/pdf-non-ascii-languages.html or an appendix of it.
I really appreciate the work that all of you have done in this topic and I hope we can manage in a better way all of these languages at Read the Docs 🌏
I have tested Simplified Chinese (
zh-hans
,zh_CN
on RTD), Traditional Chinese (zh-trad
,zh_TW
on RTD) and Japanese (ja
). I came out with the following config for each of the languages:zh-hans
zh-hant (updated to solve # 2 below)
About
RTD currently doesn’t tell Hong Kong and Taiwan variants of Traditional Chinese apart, this portion would not contribute much. I would still leave it here in case anyone needs it. (updated to solve # 2 below)zh-hant-hk
ja
To use with
Ifplatex
platex
is still to be used instead ofuplatex
for whatever reason:* A third font is not needed in Japanese (like italics/Kai) as not much of a need is seen in Japanese typesetting.
Demo and testing
I’m not sure if there is a programmatic way to test if a PDF output contains the correct set of fonts for rendering. But for the sake of completeness, I have included a copy of TeX source, and PDF output I used to test these fonts.
zh-hans
[Source] [PDF]zh-hant
[Source] [PDF] (updated to solve # 2 below)zh-hant-hk
[Source] [PDF] (updated to solve # 2 below)ja
(uptex
) [Source] [PDF]ja
(ptex
) [Source] [PDF]These PDFs are produced on a
readthedocs/build:5.0
docker container with extra fonts installed.Some decisions and points in doubt
ccmp
in XeLaTeX.As you might have seen in the samples above, there is a long sequence reads “⿺辶⿳穴⿰月⿰⿲⿱幺長⿱言馬⿱幺長刂心” that doesn’t seem like Chinese. This is a feature unique to the Noto CJK/Source Han typefaces that replace the sequence into one (super complicated) character through the
ccmp
(Glyph Composition/Decomposition) GSUB feature. See this blog from Adobe for details. Some suggested that this works with XeLaTeX seemingly out of box, but I can get it to work on my script. @iruletheworld, do you have any experience on this?Punctuation style inzh-trad
.Punctuation style is set to “plain” for zh-trad due to an awkward typesetting in default settings. This shouldn’t be much of an issue for general uses. This is potentially an issue with theResolved.xeCJK
package. I have raised an issue there regarding this.uplatex
instead ofplatex
forja
.According to sources [1] [2],
uplatex
is a variant ofplatex
that supports Unicode (rather than the old JIS level 1). It thus works with a wider range characters that includes some “rare-but-not-so-rare” characters which often appears in names. This should be a drop-in change on the Sphinx level if user has not defined configurations otherwise (but don’t quote me on that).Despite
xeCJK
has multiple font weight support, no extra effort is made on that so as to align with the default behavior of other LaTeX setups. (otf
+pxchfon
comes with simple option to enable multiple weights.)Unfortunately there isn’t much I can research on the Korean usages of TeX as I don’t speak their language. It would be much appreciated if anyone from the Korean TeX community can contribute their opinions on this.
I’m always open to any suggestions and opinions on this, especially from TeX users, and our friends speaking Chinese/Japanese/Korean. Let me know if there is any question.
An (unperfect) Solution
Ok, after much trial and error, I’ve found an acceptable solution, only to zh-hant (TW) and neither zh-hant (HK) nor zh-hans (therefore unperfect).
I now believe this is a font problem on the server since changing to Debian available fonts does work (to an extend). The available fonts I found are here.
If you go into the repo and use tag
zh-hant_TW_passed_1.0.0
then you can examine the solution. I will explain the details below, including:Root Cause of the Problem
I believe the root cause of the problem is the
Fandol
font whichctex
defaults to when the OS is neither Mac or Windows.Fandol
is quite incomplete, especially for traditional Chinese.Why
ctex
withfontset = unbuntu
doesn’t work either?This is because with
fontset = unbuntu
,ctex
would try to use the WenQunYi family (fonts-wqy-zenhei
and others). But this font family is no longer shipped with Ubuntu (e.g. 18.0.4). Therefor Latex will not be able to find the fonts needed.Ok, what font then?
The
Droid Sans Fallback
works, but you don’t have serif with it (as it says on the tin already).Droid Sans Fallback
usingxeCJK
:Droid Sans Fallback
works with both zh-hant (TW) and zh-hant (HK).But I DO want serif and Chinese italic (KaiTi, 楷体/楷體)
This is where the constraint comes in, as I have not found a serif font supports zh-hant (TW), zh-hant (HK) and zh-hans all three on Debian.
I only manage to get zh-hant (TW) working, but zh-hant (HK) and zh-hans will have missing characters.
AR PL Mingti2L Big5
for CJK main font,AR PL KaitiM Big5
for italic, andDroid Sans Fallback
for sans (AR fonts from Arphic Technology, i.e., 文鼎,Wén Dǐng in Chinese pinyin):Note that you must use zh-hant (TW) characters, otherwise some characters would be missing. For example, “爲” is HK, while “為” is TW, and “为” is the simplified version of them. I recommend opencc for translation.
So, until Ubuntu ships some really good Chinese fonts by default (e.g., the Noto Han/Source Han family, which gets installed if you add the Chinese language to Ubuntu), I am stuck (I am not a fan of the AR family. But I love the Noto Han/Source Han family).
What about zh-hans (simplified Chinese)?
Surprisingly, I tried the GB versions of the AR family and it did not work (
GB
is “<ruby>国<rt>Guó</rt>标 <rt>Biāo</rt></ruby>”, meaning “National Standard”, not “Great Britain”, lol). So, you may be stuck withFandol
. But many characters seem to be ok.What about zh-hant (HK) then?
Well, someone donate a font to Ubuntu? Maybe the HK gov. should do it? Lol. At the moment, you may be stuck with
\setCJKmainfont{Droid Sans Fallback}
and will lose all serif.Why
xeCJK
instead ofctex
?xeCJK
is newer and more flexible and needs fewer configs.Proposal to expand #5453
Since I ran into this trap (specific to remote readthedocs.org PDF build), I propose to expand #5453 a bit. My proposal would use
xeCJK
instead ofctex
.conf.py
, use the following options for LatexAR PL Mingti2L Big5
is the main font as in serif/宋体/宋體/明體;AR PL KaitiM Big5
is the italic/KaiTi/楷体/楷體;Droid Sans Fallback
is the sans serif/无衬线/無襯線.Fandol
(default to on readthedocs.org remote)If not, add
\setCJKmainfont{Droid Sans Fallback}
under\usepackage{xeCJK}
. You will lose all serif but the characters will show on the remote built PDF.Conclusions on this stage
Fandol
Droid Sans Fallback
(the AR family is not handsome though)Droid Sans Fallback
but losing all serifxelatex
withxeCJK
is brilliantI think I more or less get to the bottom of this issue and it can be closed now.