pulsar: [Doc][improve] Auto-detect broken links and integrate into CI process

Search before asking

I searched in the issues and found nothing similar.

What issue do you find in Pulsar docs?

After #17495 and #17599, we can see many broken links in the Pulsar document. And more links may break during further evolution.

What is your suggestion?

In #17599 I use a script but it’s not reliable enough so I have to check the links list manually. I suggest developing a reliable script to auto-detect the incorrect links in the Pulsar document. Maybe we can integrate this script into the CI process which is related to document change.

Before development, I’d like to enumerate all kinds of broken links.

1. wrong markdown file reference

For example the link of this page:

The markdown content is [configuration](reference-configuration.md), but the reference-configuration.md file not exists.

2. 404 URL path

For example the link of this page:

The markdown content is [type](/api/client/index.html?org/apache/pulsar/client/api/CompressionType.html), but the Pulsar site doesn’t have this path.

3. confusing URL path

For example the link of this page:

The markdown content is [Pulsar Functions CLI](/tools/pulsar-admin/), but this refers to a confusing page:

4. invalid title anchor

We can use # to refer to a specific block of HTML this way: [dataDir](reference-configuration.md#zookeeper-dataDir). So if our script can detect the anchor will be better.

Our script should be able to detect these broken links and print warning messages to users.

cc @tisonkun @Anonymitaet @momo-jun @michaeljmarshall

Any reference?

No response

Are you willing to submit a PR?

I’m willing to submit a PR!

About this issue

Original URL
State: closed
Created 2 years ago
Reactions: 2
Comments: 19 (18 by maintainers)

Most upvoted comments

OK. But all the reference-configuration.md is 404… I think we should do something to optimize it.

FYI: https://github.com/apache/pulsar/pull/18132/files#diff-2f6afd23b9f4fddf8ea934eb85c62599f7d05e01bb457cbabc2a178cd92ddab9 is adding reference-configuration.md back

CC @SignorMercurio

Anonymitaet on Oct 21, 2022

@Anonymitaet yes. I think that we should retain the URL /docs/<path> for convenience and current usage, but it helps to serve it as an alias.

That is, the source of truth is /docs/<latest-version>/<path> and we set up /docs/<content> as an alias to the latest stable versioned one. In this way, users can use /docs/<path> to access the latest stable version (which can be changed), while where there needs an immutable link, it uses /docs/<latest-version>/<path>.

I think this should be done in the building stage. And let’s move the discussion to #17438 instead of polluting this thread 😃

tisonkun on Oct 24, 2022

@Anonymitaet one comment here is about #17438. Otherwise, when a release note refers to the latest stable version, it can only use /docs/path/to/page instead of a stable /docs/version/path/to/page. If we later make some refactors for the next version and after the next stable version is released, those links can be broken. Even we don’t break the link, when the next stable version is released, the target of the link is semantically wrong.

For example, [Kafka Connector](https://pulsar.apache.org/docs/en/io-kafka/) in this page links to a removed page.

tisonkun on Oct 21, 2022

For external tools, normally we use https://www.drlinkcheck.com/ to check URLs.

Anonymitaet on Oct 13, 2022

I’m adding Docu’s builtin support now. Integrate into CI can be a follow-up.

tisonkun on Jan 8, 2023

#18014 Looks great, but it seems like a big change. If we build the site dynamically, I think detecting the generated HTML files may be better than the external crawler.

Before #18014 finishes, I think 1. wrong markdown file reference and 4. invalid title anchors are still worth detecting.

labuladong on Oct 13, 2022