readthedocs.org: PDF not showing some traditional Chinese characters
Details
- Read the Docs project URL: https://readthedocs.org/projects/latex2img/
- Build URL (if applicable): https://readthedocs.org/projects/latex2img/builds/9837446/ https://readthedocs.org/projects/latex2img/downloads/pdf/latest/
Sorry to open this issue, but I have read a lot of the related issues and have googled a lot but still cannot get it fixed.
I am using the method in #5453 to build the Latex PDF for zh_TW. The local build is fine, all things as expected, but the remote build has some characters missing, e.g., “換”, “佈” (basically a funny PDF).
Here is the setting in conf.py.
latex_engine = 'xelatex'
latex_use_xindy = False
latex_elements = {
'papersize': 'a4paper',
'pointsize': '10pt',
'preamble': r'''
\usepackage[UTF8]{ctex}
\usepackage{float}
\usepackage{graphicx}
\usepackage{indentfirst}
\setlength{\parindent}{2em}
''',
'figure_align': 'H',
}
I have tried not using ctex but just xeCJK with a few different fonts but still not working.
By the way, the simplified Chinese translation is all correct (I use just xeCJK for it). Also, the HTML is fine with either language.
Expected Result
All traditional Chinese characters display correctly.
Actual Result
Some traditional Chinese characters, e.g., “換”, “佈”, are not displayed (the font used on the server does not have them?)
About this issue
- Original URL
- State: closed
- Created 5 years ago
- Comments: 23 (9 by maintainers)
Hi @humitos , thanks for replying. I dig into the
ctexmanual and think I might have located the cause (the defaultFandolfont).In here, I would:
ctexsolution from the manualctexrepoDetailed Comparisons
Some examples of missing characters are shown in the screenshots below. The missing characters are rendered as “F” in squares.
Expected:
Excepted:
Expected:
I uploaded a local PDF build for your reference:
latex2img/remote_build_pdf_issues/latex2img_local_build.pdfMEWs for reproducing this issue locally
Basically, you need to force the
Fandolfont, otherwisectexwould select fonts based on the OS (ctexmanual page 6, table 3, section 4.3).MWE using
ctex:“錄”, “換” and “註” are missing.
You can use
xelatexwithfandolto reproduce it as well:Suspected Cause
I think this is because of the
Fandolfonts whichctexdefaults to when it detects the OS is neither Mac or Windows (manual page 6, table 3).This table basically says, when using
xelatex, for Mac OS X, it uses the HuaWen font family by default; for Window Vista and plus, it defaults to ZhongYi family + Microsoft YaHei; for Window XP and minus, it defaults to ZhongYi family; for others, it defaults to theFandolfamily.The two MWEs above try to force the
Fandolfonts to reproduce the problem and have reproduced successfully.I think this may be something to deal the implementation of
Fandol, especially how bold fonts are implemented, either real bold or not.Possible Solution
I think there may be two solutions.
ctexnatively solution.ctexallows users to explicitly setting fonts. This is shown in Example 5 in page 7. The code would require you to definedocumentclassoptions, by setting[fontset = none]in thedocumentclassand then set it again in the premable.xeCJKpackage and use\setCJKmainfont{<available font>}to use the correct font.Progress To Date
I tried method 1 by using
'extraclassoptions': r'fontset = none',and\ctexset{fontset = ubuntu}, still not working (got worse actually). The settings inconf.pyis:Summary So Far
ctexdefaulting toFandolfont family (especial theFandolHeifor bold) on the remote serverctexorxeCJKto useFandolHei(ctexon Windows defaults toMicrosoft YaHeiand will not have this problem)ctexto usefontset = ubuntumade things worse (ubuntu CJK fonts not installed?)ctexavailable on the server (slim chance?) or usexeCJKto force readily available CJK fontsFandolseems to have quite a bit of them)ctexrepo confirm thatFandolhas a quite a bit of TC fonts missingOops, I forgot to change the image name after copying.
Now everything works and the PDF output looks much better with the new set of fonts. Thank you for the effort!
Thanks for the prompt reply!
I found that there is a Ubuntu package
fonts-noto-cjk (1:20170601+repack1-2)https://packages.ubuntu.com/source/bionic/fonts-noto-cjk which seems to be an repackaging of of Google’sNoto CJKfamily. Since this is just Adobe’sSource CJKin a different name (which is the one I use locally), it may very well be able to display all Chinese characters (well, all those normal people use, the full Unicode set is a bit over the top), regarless of the regional differences.I think for most users, the substance comes before the style, and therefore this may be the preferred solution. Though some users may need the serif and italic for their reasons. Also, full sans is kinda valiate the typset customs for the Chinese language, but that should be much less of an issue.
Google’s distribution of Adobe’s
Source CJKseries is named “Noto CJK” for that reason 😄 (for the overwhelming majority anyway).Agreed. I hope the user would at least understand a bit about the difficulty of typesetting CJK. Perhaps, it should cover not just Chinese but also Japanese and Korean, which may make it quite complicated.
I think Japanese may even be more difficult to get right due to the mixture of Kanji (Chinese characters), Hiragana (consider them as lowercase phonetic syllabary) and Katakana (consider them as uppercase phonetic syllabary).
Korean should be consistent, since they have made quite an effort to get rid of the Chinese language after Japanese rule. Though they do use Chinese characters in some cases, these cases are quite limited.
Anyway, I do think the
Noto CJKshould be able to solve CJK characters problem in most cases. It actually maxes out the characters that can be placed in OTF. I just cannot say it is a sliver bullet until I can verify it in full or see other reliable reports.The thing with
Notothough, is that the filenames can vary depends on distributions or platform (since it is open source, people are free to repackage it and thus the problem). This introduce extra complication for\setCJKmainfonts{}since it needs the exact name.For PR, I would like to test a bit more, so that it could be more definitive. I also hope to find a vlid solution for all CJK and not just TC and SC (help wanted!)
@blueset wow! I’m very happy reading that 😄 --Thank you a lot for helping us debugging this issue and make Read the Docs better and improve our support with other fonts 😃
If anything is missing here, I’d say that we can improve our Documentation Guide mentioning how these fonts can be configured but I think we can close this issue now.
@blueset actually, 7.0 is our current
testingimage (as it name says, it’s for testing purposes only) and it would be awesome if you have some time and try to use these fonts from there and letting us know that it works 😃@blueset THANKS, this is amazing!
I’ve already opened a PR to install the package fonts you mentioned in a previous comment.
I can’t guarantee that we are going to include these preambles by default on a Read the Docs build because they will probably need a lot of testing (and I’m not an expert on this topic to can manage it) but I’d like to add them as suggestion in our current guide https://docs.readthedocs.io/en/stable/guides/pdf-non-ascii-languages.html or an appendix of it.
I really appreciate the work that all of you have done in this topic and I hope we can manage in a better way all of these languages at Read the Docs 🌏
I have tested Simplified Chinese (
zh-hans,zh_CNon RTD), Traditional Chinese (zh-trad,zh_TWon RTD) and Japanese (ja). I came out with the following config for each of the languages:zh-hans
zh-hant (updated to solve # 2 below)
About
RTD currently doesn’t tell Hong Kong and Taiwan variants of Traditional Chinese apart, this portion would not contribute much. I would still leave it here in case anyone needs it. (updated to solve # 2 below)zh-hant-hkja
To use with
Ifplatexplatexis still to be used instead ofuplatexfor whatever reason:* A third font is not needed in Japanese (like italics/Kai) as not much of a need is seen in Japanese typesetting.
Demo and testing
I’m not sure if there is a programmatic way to test if a PDF output contains the correct set of fonts for rendering. But for the sake of completeness, I have included a copy of TeX source, and PDF output I used to test these fonts.
zh-hans[Source] [PDF]zh-hant[Source] [PDF] (updated to solve # 2 below)zh-hant-hk[Source] [PDF] (updated to solve # 2 below)ja(uptex) [Source] [PDF]ja(ptex) [Source] [PDF]These PDFs are produced on a
readthedocs/build:5.0docker container with extra fonts installed.Some decisions and points in doubt
ccmpin XeLaTeX.As you might have seen in the samples above, there is a long sequence reads “⿺辶⿳穴⿰月⿰⿲⿱幺長⿱言馬⿱幺長刂心” that doesn’t seem like Chinese. This is a feature unique to the Noto CJK/Source Han typefaces that replace the sequence into one (super complicated) character through the
ccmp(Glyph Composition/Decomposition) GSUB feature. See this blog from Adobe for details. Some suggested that this works with XeLaTeX seemingly out of box, but I can get it to work on my script. @iruletheworld, do you have any experience on this?Punctuation style inzh-trad.Punctuation style is set to “plain” for zh-trad due to an awkward typesetting in default settings. This shouldn’t be much of an issue for general uses. This is potentially an issue with theResolved.xeCJKpackage. I have raised an issue there regarding this.uplatexinstead ofplatexforja.According to sources [1] [2],
uplatexis a variant ofplatexthat supports Unicode (rather than the old JIS level 1). It thus works with a wider range characters that includes some “rare-but-not-so-rare” characters which often appears in names. This should be a drop-in change on the Sphinx level if user has not defined configurations otherwise (but don’t quote me on that).Despite
xeCJKhas multiple font weight support, no extra effort is made on that so as to align with the default behavior of other LaTeX setups. (otf+pxchfoncomes with simple option to enable multiple weights.)Unfortunately there isn’t much I can research on the Korean usages of TeX as I don’t speak their language. It would be much appreciated if anyone from the Korean TeX community can contribute their opinions on this.
I’m always open to any suggestions and opinions on this, especially from TeX users, and our friends speaking Chinese/Japanese/Korean. Let me know if there is any question.
An (unperfect) Solution
Ok, after much trial and error, I’ve found an acceptable solution, only to zh-hant (TW) and neither zh-hant (HK) nor zh-hans (therefore unperfect).
I now believe this is a font problem on the server since changing to Debian available fonts does work (to an extend). The available fonts I found are here.
If you go into the repo and use tag
zh-hant_TW_passed_1.0.0then you can examine the solution. I will explain the details below, including:Root Cause of the Problem
I believe the root cause of the problem is the
Fandolfont whichctexdefaults to when the OS is neither Mac or Windows.Fandolis quite incomplete, especially for traditional Chinese.Why
ctexwithfontset = unbuntudoesn’t work either?This is because with
fontset = unbuntu,ctexwould try to use the WenQunYi family (fonts-wqy-zenheiand others). But this font family is no longer shipped with Ubuntu (e.g. 18.0.4). Therefor Latex will not be able to find the fonts needed.Ok, what font then?
The
Droid Sans Fallbackworks, but you don’t have serif with it (as it says on the tin already).Droid Sans FallbackusingxeCJK:Droid Sans Fallbackworks with both zh-hant (TW) and zh-hant (HK).But I DO want serif and Chinese italic (KaiTi, 楷体/楷體)
This is where the constraint comes in, as I have not found a serif font supports zh-hant (TW), zh-hant (HK) and zh-hans all three on Debian.
I only manage to get zh-hant (TW) working, but zh-hant (HK) and zh-hans will have missing characters.
AR PL Mingti2L Big5for CJK main font,AR PL KaitiM Big5for italic, andDroid Sans Fallbackfor sans (AR fonts from Arphic Technology, i.e., 文鼎,Wén Dǐng in Chinese pinyin):Note that you must use zh-hant (TW) characters, otherwise some characters would be missing. For example, “爲” is HK, while “為” is TW, and “为” is the simplified version of them. I recommend opencc for translation.
So, until Ubuntu ships some really good Chinese fonts by default (e.g., the Noto Han/Source Han family, which gets installed if you add the Chinese language to Ubuntu), I am stuck (I am not a fan of the AR family. But I love the Noto Han/Source Han family).
What about zh-hans (simplified Chinese)?
Surprisingly, I tried the GB versions of the AR family and it did not work (
GBis “<ruby>国<rt>Guó</rt>标 <rt>Biāo</rt></ruby>”, meaning “National Standard”, not “Great Britain”, lol). So, you may be stuck withFandol. But many characters seem to be ok.What about zh-hant (HK) then?
Well, someone donate a font to Ubuntu? Maybe the HK gov. should do it? Lol. At the moment, you may be stuck with
\setCJKmainfont{Droid Sans Fallback}and will lose all serif.Why
xeCJKinstead ofctex?xeCJKis newer and more flexible and needs fewer configs.Proposal to expand #5453
Since I ran into this trap (specific to remote readthedocs.org PDF build), I propose to expand #5453 a bit. My proposal would use
xeCJKinstead ofctex.conf.py, use the following options for LatexAR PL Mingti2L Big5is the main font as in serif/宋体/宋體/明體;AR PL KaitiM Big5is the italic/KaiTi/楷体/楷體;Droid Sans Fallbackis the sans serif/无衬线/無襯線.Fandol(default to on readthedocs.org remote)If not, add
\setCJKmainfont{Droid Sans Fallback}under\usepackage{xeCJK}. You will lose all serif but the characters will show on the remote built PDF.Conclusions on this stage
FandolDroid Sans Fallback(the AR family is not handsome though)Droid Sans Fallbackbut losing all serifxelatexwithxeCJKis brilliantI think I more or less get to the bottom of this issue and it can be closed now.