pulsar: [Doc][improve] Auto-detect broken links and integrate into CI process
Search before asking
- I searched in the issues and found nothing similar.
What issue do you find in Pulsar docs?
After #17495 and #17599, we can see many broken links in the Pulsar document. And more links may break during further evolution.
What is your suggestion?
In #17599 I use a script but it’s not reliable enough so I have to check the links list manually. I suggest developing a reliable script to auto-detect the incorrect links in the Pulsar document. Maybe we can integrate this script into the CI process which is related to document change.
Before development, I’d like to enumerate all kinds of broken links.
1. wrong markdown file reference
For example the link of this page:

The markdown content is [configuration](reference-configuration.md)
, but the reference-configuration.md
file not exists.
2. 404 URL path
For example the link of this page:

The markdown content is [type](/api/client/index.html?org/apache/pulsar/client/api/CompressionType.html)
, but the Pulsar site doesn’t have this path.
3. confusing URL path
For example the link of this page:
The markdown content is [Pulsar Functions CLI](/tools/pulsar-admin/)
, but this refers to a confusing page:

4. invalid title anchor
We can use #
to refer to a specific block of HTML this way: [dataDir](reference-configuration.md#zookeeper-dataDir)
. So if our script can detect the anchor will be better.
Our script should be able to detect these broken links and print warning messages to users.
cc @tisonkun @Anonymitaet @momo-jun @michaeljmarshall
Any reference?
No response
Are you willing to submit a PR?
- I’m willing to submit a PR!
About this issue
- Original URL
- State: closed
- Created 2 years ago
- Reactions: 2
- Comments: 19 (18 by maintainers)
FYI: https://github.com/apache/pulsar/pull/18132/files#diff-2f6afd23b9f4fddf8ea934eb85c62599f7d05e01bb457cbabc2a178cd92ddab9 is adding
reference-configuration.md
backCC @SignorMercurio
@Anonymitaet yes. I think that we should retain the URL
/docs/<path>
for convenience and current usage, but it helps to serve it as an alias.That is, the source of truth is
/docs/<latest-version>/<path>
and we set up/docs/<content>
as an alias to the latest stable versioned one. In this way, users can use/docs/<path>
to access the latest stable version (which can be changed), while where there needs an immutable link, it uses/docs/<latest-version>/<path>
.I think this should be done in the building stage. And let’s move the discussion to #17438 instead of polluting this thread 😃
@Anonymitaet one comment here is about #17438. Otherwise, when a release note refers to the latest stable version, it can only use
/docs/path/to/page
instead of a stable/docs/version/path/to/page
. If we later make some refactors for the next version and after the next stable version is released, those links can be broken. Even we don’t break the link, when the next stable version is released, the target of the link is semantically wrong.For example,
[Kafka Connector](https://pulsar.apache.org/docs/en/io-kafka/)
in this page links to a removed page.For external tools, normally we use https://www.drlinkcheck.com/ to check URLs.
I’m adding Docu’s builtin support now. Integrate into CI can be a follow-up.
#18014 Looks great, but it seems like a big change. If we build the site dynamically, I think detecting the generated HTML files may be better than the external crawler.
Before #18014 finishes, I think
1. wrong markdown file reference
and4. invalid title anchors
are still worth detecting.