Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Hacker News

The Hacker News Algolia API provides searchable access to HN stories, comments, and discussions. This dataset is excellent for NLP analysis because it has:

  • Rich technical text (story titles, Ask HN posts, comments)
  • Metadata for filtering (points, comments, timestamps)
  • Real developer discussions with domain-specific vocabulary
  • Multilingual content (occasional non-English posts)

Getting the Data

# Front page stories
curl -s "https://hn.algolia.com/api/v1/search?tags=front_page&hitsPerPage=50" > hn_front.json

# Ask HN posts (rich text content)
curl -s "https://hn.algolia.com/api/v1/search?tags=ask_hn&hitsPerPage=50" > hn_ask.json

# Show HN posts
curl -s "https://hn.algolia.com/api/v1/search?tags=show_hn&hitsPerPage=50" > hn_show.json

# Search for specific topics
curl -s "https://hn.algolia.com/api/v1/search?query=rust&tags=story&hitsPerPage=50" > hn_rust.json

# Comments on a story
curl -s "https://hn.algolia.com/api/v1/search?tags=comment,story_12345&hitsPerPage=100" > comments.json

Data Structure

The search API returns:

{
  "hits": [
    {
      "objectID": "37392676",
      "title": "Ask HN: I'm an FCC Commissioner proposing regulation of IoT security",
      "url": null,
      "author": "SimingtonFCC",
      "points": 2847,
      "story_text": "Hi everyone, I'm FCC Commissioner Nathan Simington...",
      "num_comments": 475,
      "created_at": "2023-09-05T15:00:00.000Z",
      "created_at_i": 1693926000,
      "_tags": ["story", "author_SimingtonFCC", "story_37392676", "ask_hn"]
    }
  ],
  "nbHits": 12345,
  "page": 0,
  "nbPages": 50,
  "hitsPerPage": 50
}

Key fields:

  • title - Story headline (always present)
  • story_text - Full text for Ask HN / Show HN posts (HTML)
  • url - External link (null for Ask HN)
  • points - Upvotes
  • num_comments - Discussion size
  • _tags - Includes ask_hn, show_hn, front_page, etc.

Basic Queries

List Story Titles

jpx 'hits[*].title' hn_front.json

Get Top Stories with Metadata

jpx 'hits[*].{
  title: title,
  points: points,
  comments: num_comments,
  author: author
}' hn_front.json

Filter by Points

jpx 'hits[?points > `500`].title' hn_front.json

Sort by Discussion Size

jpx 'sort_by(hits, &num_comments) | reverse(@) | [:10].{
  title: title,
  comments: num_comments
}' hn_front.json

NLP Analysis on Titles

Tokenize and Analyze Titles

# Get word frequencies across all titles
jpx 'hits[*].title | join(` `, @) | tokens(@) | remove_stopwords(@) | frequencies(@)' hn_front.json

Extract Keywords from Top Stories

jpx 'hits[?points > `200`].title 
  | join(` `, @) 
  | tokens(@) 
  | remove_stopwords(@) 
  | stems(@) 
  | frequencies(@)' hn_front.json

Find Common Technical Terms

# Stem and count to normalize variations (Rust/rust, APIs/API)
jpx 'hits[*].title 
  | join(` `, @) 
  | tokens(@) 
  | stems(@) 
  | frequencies(@) 
  | items(@) 
  | sort_by(@, &[1]) 
  | reverse(@) 
  | [:20]' hn_front.json

Bigram Analysis on Headlines

jpx 'hits[*].title | join(` `, @) | bigrams(@) | frequencies(@)' hn_front.json

Analyzing Ask HN Posts

Ask HN posts have rich story_text content - perfect for NLP.

Extract and Clean Story Text

# Remove HTML tags and normalize
jpx 'hits[0].story_text 
  | regex_replace(@, `<[^>]+>`, ` `) 
  | collapse_whitespace(@)' hn_ask.json

Full Text Analysis Pipeline

jpx 'hits[0].story_text 
  | regex_replace(@, `<[^>]+>`, ` `)
  | tokens(@) 
  | remove_stopwords(@) 
  | stems(@) 
  | frequencies(@)
  | items(@)
  | sort_by(@, &[1])
  | reverse(@)
  | [:15]' hn_ask.json

Compare Vocabulary Across Posts

# Extract top keywords per Ask HN post
jpx 'hits[:5] | [*].{
  title: title,
  keywords: story_text 
    | regex_replace(@, `<[^>]+>`, ` `) 
    | tokens(@) 
    | remove_stopwords(@) 
    | stems(@) 
    | frequencies(@) 
    | items(@) 
    | sort_by(@, &[1]) 
    | reverse(@) 
    | [:5][*][0]
}' hn_ask.json

Reading Time Estimates

jpx 'hits[:10] | [*].{
  title: title,
  reading_time: story_text | regex_replace(@, `<[^>]+>`, ` `) | reading_time(@),
  word_count: story_text | regex_replace(@, `<[^>]+>`, ` `) | word_count(@)
}' hn_ask.json

Topic Detection

Categorize by Keywords

# Find AI/ML related posts
jpx 'hits[?contains(lower(title), `ai`) || 
        contains(lower(title), `llm`) || 
        contains(lower(title), `gpt`) ||
        contains(lower(title), `machine learning`)].{
  title: title,
  points: points
}' hn_front.json

Programming Language Mentions

jpx 'hits[*].{
  title: title,
  mentions_rust: contains(lower(title), `rust`),
  mentions_go: regex_match(lower(title), `\\bgo\\b`),
  mentions_python: contains(lower(title), `python`)
} | [?mentions_rust || mentions_go || mentions_python]' hn_front.json

N-gram Topic Extraction

# Find common 2-word phrases (potential topics)
jpx 'hits[*].title 
  | [*] | join(` `, @) 
  | lower(@)
  | ngrams(@, `2`, `word`)
  | frequencies(@)
  | items(@)
  | sort_by(@, &[1])
  | reverse(@)
  | [:10]' hn_front.json

Sentiment and Engagement Analysis

High Engagement Posts

# Posts with high comment-to-point ratio (controversial?)
jpx 'hits[?points > `100`] | [*].{
  title: title,
  points: points,
  comments: num_comments,
  ratio: divide(num_comments, points)
} | sort_by(@, &ratio) | reverse(@) | [:10]' hn_front.json

Question Posts (Seeking Help)

jpx 'hits[?ends_with(title, `?`)].{
  title: title,
  comments: num_comments
}' hn_front.json

Time-Based Analysis

Posts by Hour

jpx 'hits[*].{
  title: title,
  hour: split(created_at, `T`)[1] | split(@, `:`)[0]
} | group_by(@, &hour)' hn_front.json

Recent vs Older Content

jpx 'hits | {
  recent: [?created_at_i > `1700000000`] | length(@),
  older: [?created_at_i <= `1700000000`] | length(@)
}' hn_front.json

Author Analysis

Most Active Authors

jpx 'hits[*].author | frequencies(@) | items(@) | sort_by(@, &[1]) | reverse(@) | [:10]' hn_front.json

Author Vocabulary Fingerprint

# What words does a specific author use most?
jpx 'hits[?author == `dang`].title 
  | join(` `, @) 
  | tokens(@) 
  | remove_stopwords(@) 
  | frequencies(@)' hn_front.json

Cross-Dataset Comparisons

Compare Ask HN vs Show HN Vocabulary

# Run separately and compare results
jpx 'hits[*].title | join(` `, @) | tokens(@) | remove_stopwords(@) | stems(@) | frequencies(@)' hn_ask.json > ask_vocab.json
jpx 'hits[*].title | join(` `, @) | tokens(@) | remove_stopwords(@) | stems(@) | frequencies(@)' hn_show.json > show_vocab.json

Search Query Analysis

# Fetch multiple topics and compare
curl -s "https://hn.algolia.com/api/v1/search?query=kubernetes&hitsPerPage=30" > hn_k8s.json
curl -s "https://hn.algolia.com/api/v1/search?query=docker&hitsPerPage=30" > hn_docker.json

# Compare title vocabulary
jpx 'hits[*].title | join(` `, @) | tokens(@) | remove_stopwords(@) | frequencies(@)' hn_k8s.json
jpx 'hits[*].title | join(` `, @) | tokens(@) | remove_stopwords(@) | frequencies(@)' hn_docker.json

Building a Search Index

Use the NLP functions to prepare content for search:

# Create searchable document representations
jpx 'hits[:20] | [*].{
  id: objectID,
  title: title,
  normalized_title: title | lower(@) | remove_accents(@),
  title_tokens: title | tokens(@) | remove_stopwords(@) | stems(@),
  has_text: story_text != null,
  text_keywords: story_text 
    | default(@, `""`)
    | regex_replace(@, `<[^>]+>`, ` `)
    | tokens(@)
    | remove_stopwords(@)
    | stems(@)
    | unique(@)
    | slice(@, `0`, `20`)
}' hn_ask.json

Complete Analysis Pipeline

Here’s a comprehensive analysis combining multiple NLP techniques:

jpx '{
  meta: {
    total_stories: length(hits),
    total_points: sum(hits[*].points),
    avg_comments: avg(hits[*].num_comments)
  },
  top_keywords: hits[*].title 
    | join(` `, @) 
    | tokens(@) 
    | remove_stopwords(@) 
    | stems(@) 
    | frequencies(@)
    | items(@)
    | sort_by(@, &[1])
    | reverse(@)
    | [:10][*][0],
  top_bigrams: hits[*].title
    | join(` `, @)
    | lower(@)
    | bigrams(@)
    | [*] | join(` `, @)
    | frequencies(@)
    | items(@)
    | sort_by(@, &[1])
    | reverse(@)
    | [:5][*][0],
  question_posts: length(hits[?ends_with(title, `?`)]),
  avg_title_words: avg(hits[*].title | [*] | word_count(@))
}' hn_front.json

Using Query Libraries

Instead of typing these queries repeatedly, save them in a .jpx query library. See examples/hacker-news.jpx for a ready-to-use library:

# List available queries
jpx -Q examples/hacker-news.jpx --list-queries

# Run common analyses
jpx -Q examples/hacker-news.jpx:title-keywords hn_front.json
jpx -Q examples/hacker-news.jpx:top-stories hn_front.json
jpx -Q examples/hacker-news.jpx:summary hn_front.json

# Output as table
jpx -Q examples/hacker-news.jpx:most-discussed -t hn_front.json

See Query Files for more on creating and using query libraries.


Tips for HN Data

  1. HTML in story_text: Always strip HTML tags before NLP processing

    regex_replace(story_text, `<[^>]+>`, ` `)
    
  2. Rate limiting: The Algolia API is generous but cache results locally for iteration

  3. Pagination: Use page parameter for more results

    curl "https://hn.algolia.com/api/v1/search?tags=ask_hn&page=2&hitsPerPage=50"
    
  4. Date filtering: Use numericFilters for time ranges

    curl "https://hn.algolia.com/api/v1/search?numericFilters=created_at_i>1700000000"
    
  5. Combining with fuzzy search: Use jpx’s fuzzy_search on processed results

    jpx 'fuzzy_search(hits, `title`, `database`)' hn_front.json