Broken Links Validator

August 17, 2024 (2mo ago)

Introduction

Ensuring the quality and accuracy of documentation is crucial for providing a seamless user experience. Regularly checking for broken links helps maintain the integrity of your documentation and prevents users from encountering outdated or incorrect information.

This guide will walk you through the process of checking for broken links in markdown files using markdown-link-check and offer a more comprehensive approach by integrating a custom validation tool.

Prerequisites

Before starting, make sure you have the following tools installed on your system:

  • Node.js and npm: These are required to install and use the markdown-link-check package. Verify your installation by running:
node -v
npm -v

Step-by-Step Instructions

Install markdown-link-check Globally

To check for broken links, you’ll need the markdown-link-check package. Install it globally using npm:

npm install -g markdown-link-check

Navigate to Your Repository

For this guide we will use hedera-docs repository. You can clone the repository you wish to check, do so and navigate to its directory:

git clone https://github.com/hashgraph/hedera-docs.git
cd hedera-docs

Run the Link Checker

To check for broken links in a specific markdown file, use the following command:

markdown-link-check README.md

Use the find command to locate all markdown files and pipe the output to markdown-link-check to check each file for broken links. Redirect the output to broken-links.txt to save the results.

find . -name "*.md" -exec markdown-link-check {} \; > broken-links.txt

Review the Output

The tool will provide an output indicating the status of each link:

  • ✔ 200 OK: The link is valid.
  • ✖ 404 Not Found: The link is broken.
  • Other Status Codes: Indicate issues like redirects or timeouts.

It should create a file that looks like this example.

Fix Broken Links

Based on the output, manually update or remove broken links to ensure your documentation remains accurate and user-friendly.

Advanced Broken Link Validation

For a more comprehensive approach to link validation, especially for large projects with numerous markdown files, you might consider using a custom validation script.

Below is a guide to creating and using a custom broken links validator in Node.js.

Setting Up the Project

Before diving into the code, you'll need to set up your environment:

1. Ensure Node.js is Installed

Make sure you have Node.js installed on your system. You can download it from nodejs.org.

2. Create a Project Directory

mkdir broken-links-validator
cd broken-links-validator

3. Initialize a New Node.js Project

npm init -y

Understanding the Code

Let's break down the code and understand each part:

1. Required Modules

const fs = require('fs');
const path = require('path');
  • fs is the file system module for reading and writing files.
  • path helps with manipulating file and directory paths.

2. Custom Progress Bar

function ProgressBar(total) {
  this.total = total;
  this.current = 0;
  this.bar_length = 50;
 
  this.update = function (current) {
    this.current = current;
    const percentage = this.current / this.total;
    const filled_length = Math.round(this.bar_length * percentage);
    const percents = Math.round(100 * percentage);
    const bar = '='.repeat(filled_length * 2) + '-'.repeat(this.bar_length - filled_length * 2);
 
    process.stdout.write(`\r[${bar}] ${percents*2}% | ${this.current}/${this.total / 2}`);
 
    if (this.current === this.total) {
      process.stdout.write('\n');
    }
  }
}

The ProgressBar function displays a visual progress bar in the console. It helps track the progress of link validation.

3. Fetching with Retry

const delay = (ms) => new Promise(resolve => setTimeout(resolve, ms));
 
async function fetchWithRetry(url, options, retries = 3, backoff = 300) {
  try {
    return await fetch(url, options);
  } catch (err) {
    if (retries === 0) throw err;
    await delay(backoff);
    return fetchWithRetry(url, options, retries - 1, backoff * 2);
  }
}
  • delay function pauses execution for a specified duration.
  • fetchWithRetry attempts to fetch a URL, retrying if the request fails, with exponential backoff to handle temporary issues.

4. Finding Line Numbers

async function findLineNumber(filePath, searchString) {
  try {
    const response = await fetchWithRetry(`${rawContentBaseUrl}${filePath}`);
    if (!response.ok) {
      throw new Error(`HTTP error! status: ${response.status}`);
    }
    const content = await response.text();
    const lines = content.split('\n');
    for (let i = 0; i < lines.length; i++) {
      if (lines[i].includes(searchString)) {
        return i + 1;
      }
    }
  } catch (error) {
    console.error(` Error fetching file content: ${error}`);
  }
  return null;
}

This function fetches the content of a file from a URL and searches for a specific string to determine the line number where the string appears.

5. Checking Link Status

async function isLinkBroken(url, currentFilePath) {
  if (url.startsWith('#') || url.startsWith('../') || url.startsWith('./') || url.startsWith('/')) {
    console.log(` Skipping relative link: ${url}`);
    return false;
  }
 
  try {
    const response = await fetchWithRetry(url, {
      method: 'HEAD'
    });
    return response.status === 404;
  } catch (error) {
    console.error(` Error checking link: ${error}`);
    return false;
  }
}
  • isLinkBroken checks if a URL is broken by sending a HEAD request. It skips relative links and anchor links.

6. Generating the Report

async function generateSimplifiedReport(data) {
  const lines = data.split('\n');
  let currentFile = '';
  let reportLines = ['# Broken Links Report\n'];
  let processedLinks = 0;
  let brokenLinks = [];
 
  const totalLinks = lines.filter(line => line.includes('[✖]')).length;
  const progressBar = new ProgressBar(totalLinks);
 
  for (const line of lines) {
    if (line.startsWith('FILE:')) {
      currentFile = line.replace('FILE: ', '').trim();
    } else if (line.includes('[✖]')) {
      const match = line.match(/\[✖] (.+) → Status: (.+)/);
      if (match) {
        const [link, status] = match.slice(1, 3);
 
        processedLinks++;
        progressBar.update(processedLinks);
 
        try {
          if (await isLinkBroken(link, currentFile)) {
            const lineNumber = await findLineNumber(currentFile.replace('./', '').replace(/^\.\.\//, ''), link);
            const githubFileUrl = `${baseRepoUrl}${currentFile.replace('./','').replace(/^\.\.\//, '')}`;
            const fileName = path.basename(currentFile);
 
            const truncateLength = 30;
            const linkLength = link.length;
            let truncatedLink;
            if (linkLength > truncateLength) {
              const firstPart = link.substring(0, 10);
              const lastPart = link.substring(linkLength - 20);
              truncatedLink = `${firstPart}...${lastPart}`;
            } else {
              truncatedLink = link;
            }
 
            if (lineNumber) {
              brokenLinks.push(`- [${fileName}](${githubFileUrl}?plain=1#L${lineNumber}): ${truncatedLink} (404)`);
            } else {
              console.log(` [${fileName}](${githubFileUrl}): ${truncatedLink} (404) - Line not found`);
            }
          }
        } catch (error) {
          console.error(`Error processing link ${link}: ${error}`);
        }
 
        await delay(100);
      }
    }
  }
 
  return reportLines.concat(brokenLinks).join('\n');
}

7. Complete Script

Here's the complete code for your reference:

const fs = require('fs');
const path = require('path');
 
const inputFilePath = './broken-links.txt';
const outputFilePath = './broken-links-report.md';
 
const baseRepoUrl = 'https://github.com/hashgraph/hedera-docs/blob/master/';
const rawContentBaseUrl = 'https://raw.githubusercontent.com/hashgraph/hedera-docs/master/';
 
function ProgressBar(total) {
  this.total = total;
  this.current = 0;
  this.bar_length = 50;
 
  this.update = function (current) {
    this.current = current;
    const percentage = this.current / this.total;
    const filled_length = Math.round(this.bar_length * percentage);
    const percents = Math.round(100 * percentage);
    const bar = '='.repeat(filled_length * 2) + '-'.repeat(this.bar_length - filled_length * 2);
 
    process.stdout.write(`\r[${bar}] ${percents*2}% | ${this.current}/${this.total / 2}`);
 
    if (this.current === this.total) {
      process.stdout.write('\n');
    }
  }
}
 
const delay = (ms) => new Promise(resolve => setTimeout(resolve, ms));
 
async function fetchWithRetry(url, options, retries = 3, backoff = 300) {
  try {
    return await fetch(url, options);
  } catch (err) {
    if (retries === 0) throw err;
    await delay(backoff);
    return fetchWithRetry(url, options, retries - 1, backoff * 2);
  }
}
 
async function findLineNumber(filePath, searchString) {
  try {
    const response = await fetchWithRetry(`${rawContentBaseUrl}${filePath}`);
    if (!response.ok) {
      throw new Error(`HTTP error! status: ${response.status}`);
    }
    const content = await response.text();
    const lines = content.split('\n');
    for (let i = 0; i < lines.length; i++) {
      if (lines[i].includes(searchString)) {
        return i + 1;
      }
    }
  } catch (error) {
    console.error(` Error fetching file content: ${error}`);
  }
  return null;
}
 
async function isLinkBroken(url, currentFilePath) {
  if (url.startsWith('#')) {
    return false;
  }
 
  if (url.startsWith('../') || url.startsWith('./') || url.startsWith('/')) {
    console.log(` Skipping relative link: ${url}`);
    return false;
  }
 
  try {
    const response = await fetchWithRetry(url, {
      method: 'HEAD'
    });
    return response.status === 404;
  } catch (error) {
    console.error(` Error checking link: ${error}`);
    return false;
  }
}
 
async function generateSimplifiedReport(data) {
  const lines = data.split('\n');
  let currentFile = '';
  let reportLines = ['# Broken Links Report\n'];
  let processedLinks = 0;
  let brokenLinks = [];
 
  const totalLinks = lines.filter(line => line.includes('[✖]')).length;
  const progressBar = new ProgressBar(totalLinks);
 
  for (const line of lines) {
    if (line.startsWith('FILE:')) {
      currentFile = line.replace('FILE: ', '').trim();
    } else if (line.includes('[✖]')) {
      const match = line.match(/\[✖] (.+) → Status: (.+)/);
      if (match) {
        const [link, status] = match.slice(1, 3);
 
        processedLinks++;
        progressBar.update(processedLinks);
 
        try {
          if (await isLinkBroken(link, currentFile)) {
            const lineNumber = await findLineNumber(currentFile.replace('./', '').replace(/^\.\.\//, ''), link);
            const githubFileUrl = `${baseRepoUrl}${currentFile.replace('./','').replace(/^\.\.\//, '')}`;
 
            const fileName = path.basename(currentFile);
 
            const truncateLength = 30;
            const linkLength = link.length;
            let truncatedLink;
            if (linkLength > truncateLength) {
              const firstPart = link.substring(0, 10);
              const lastPart = link.substring(linkLength - 20);
              truncatedLink = `${firstPart}...${lastPart}`;
            } else {
              truncatedLink = link;
            }
 
            if (lineNumber) {
              brokenLinks.push(`- [${fileName}](${githubFileUrl}?plain=1#L${lineNumber}): ${truncatedLink} (404)`);
            } else {
              console.log(` [${fileName}](${githubFileUrl}): ${truncatedLink} (404) - Line not found`);
            }
          }
        } catch (error) {
          console.error(`Error processing link ${link}: ${error}`);
        }
 
        await delay(100);
      }
    }
  }
 
  return reportLines.concat(brokenLinks).join('\n');
}
 
fs.readFile(inputFilePath, 'utf8', async (err, data) => {
  if (err) {
    console.error('Error reading the input file:', err);
    return;
  }
 
  console.log('Starting report generation...');
  try {
    const simplifiedReport = await generateSimplifiedReport(data);
 
    fs.writeFile(outputFilePath, simplifiedReport, 'utf8', err => {
      if (err) {
        console.error('Error writing the output file:', err);
        return;
      }
 
      console.log(`\nSimplified report generated successfully: ${outputFilePath}`);
    });
  } catch (error) {
    console.error('Error generating report:', error);
  }
});

The generateSimplifiedReport function processes the input data, checks the status of links, and generates a markdown report of broken links. It includes features such as truncating long links and handling various link formats.

Run the Script

Once you have set up your custom broken links validator script and saved it, you'll need to execute it to generate your broken links report. Follow these steps to run the script and produce the report:

1. Save the Script

Ensure you have saved the script to a file named broken-links-validator.js in your project directory.

2. Prepare Your Input File

Make sure you have an input file named broken-links.txt in the same directory. This file should contain the output of the markdown-link-check tool from previous steps.

3. Run the Script

Execute the script using Node.js:

node broken-links-validator.js

4. Review the Output

Once the script completes, it will generate a file named broken-links-report.md in the same directory. This file will contain a markdown-formatted report of the broken links found. Currently it look like this.

Here’s a quick summary of what the script does:

  • Fetches the content of files referenced in the broken links report.
  • Checks the status of each URL.
  • Generates a markdown report listing broken links along with their file location and line numbers.

Troubleshooting

If you encounter any issues, check the following:

  • File Paths: Ensure the inputFilePath and outputFilePath in the script are correct.
  • Permissions: Ensure you have permission to read from and write to the specified files.

If the script runs successfully, you’ll get a clean, user-friendly markdown report that helps you quickly identify and fix broken links in your documentation.

Conclusion

In this guide, we explored how to create a Broken Links Validator using Node.js. We covered setting up the project, understanding the key components of the code, and how each function contributes to the overall functionality.

Known Issues

This custom broken links validator script was developed rapidly and may have some limitations:

  1. Relative Paths: The script currently does not handle relative paths properly. It is designed to work with absolute URLs, and relative links are skipped in the validation process. To address this, you would need to enhance the script to resolve and handle relative paths.

  2. HTTP Status Codes: The script only processes links with status codes of 200 (OK) and 404 (Not Found). Other HTTP status codes, such as redirects (3xx), client errors (4xx other than 404), and server errors (5xx), are not explicitly handled. This limitation means that links resulting in these statuses will not be reported or processed.

  3. Error Handling: There might be additional edge cases or errors not covered by the script, such as issues with fetching certain URLs or unexpected file formats.

Improvements and additional testing are needed to address these limitations and make the script more robust for a broader range of documentation scenarios.