Data Refinery — Blue Hen RE

Provenance

Source: https://rss.arxiv.org/rss/cs.IR · extractor heuristic · chunking sentence

OKF dataset card

---
type: Dataset
title: arXiv cs.IR daily listing (RSS)
description: "Point-in-time collection: 1 docs, 35 chunks."
tags: [dataset, datalab]
timestamp: "2026-07-02T23:50:49Z"
datasetId: 20260702-185049-arxiv-cs-ir-daily-listing--rss
---

# Provenance

Point-in-time collection run `20260702-185049-arxiv-cs-ir-daily-listing--rss` — 1 documents,
35 chunks (sentence strategy,
~16206 tokens). Raw artifacts live at
`data/datalab/20260702-185049-arxiv-cs-ir-daily-listing--rss/` (docs.jsonl, chunks.jsonl, manifest.json);
trace `20260702-185048-55bb350d` in `data/traces/`.

# Sources

| Source | Status |
|--------|--------|
| `https://rss.arxiv.org/rss/cs.IR` | ok |

# Consumption

Chunks feed the training worker's pair builder and the retrieval index.
Filter on `token_estimate` and `strategy` in `chunks.jsonl`. See the
[data pipeline](/platform/data-pipeline.md) concept for stage details.

Sample chunks (first 20, sanitized)

cs.IR updates on arXiv.org http://rss.arxiv.org/rss/cs.IR cs.IR updates on the arXiv.org e-print archive. http://www.rssboard.org/rss-specification en-us Thu, 02 Jul 2026 04:00:09 +0000 rss-help@arxiv.org Thu, 02 Jul 2026 00:00:00 -0400 Sunday Saturday From "Strings" to "Things" for Personal Knowledge Graphs: Evaluating LLM Triple Extraction for Recommendation Systems https://arxiv.org/abs/2607.00

fd6e271a030f1a34

The Answer and an Approach to Bridging Vocabulary Gaps https://arxiv.org/abs/2607.00004 arXiv:2607.00004v1 Announce Type: new Abstract: While advanced foundation models like ModernBERT significantly outperform older architectures in dense retrieval, they surprisingly lag behind the aging BERT-base baseline in learned sparse retrieval (LSR). We identify the root cause as the \textit{Vocabulary Gap}

fd6e271a030f1a34

We've released our code and models.\footnote{https://anonymous.4open.science/r/vocab-transfer/. All details included.} oai:arXiv.org:2607.00004v1 cs.IR cs.AI cs.LG Thu, 02 Jul 2026 00:00:00 -0400 new http://creativecommons.org/licenses/by/4.0/ 10.1145/3805712.3809724 Zhichao Geng, Yang Yang Topological Void Analysis A Mathematical Framework for Systematic Technical Innovation Discovery in Knowledg

fd6e271a030f1a34

Applied to ~140k indexed documents, TVA generates 2,128 invention candidates across 96 targets; 90% survive automated quality filtering, yielding 191 REVISE and 1 APPROVE verdict from four-specialist adversarial review (0.05% end-to-end). Two case studies demonstrate the framework surfaces non-obvious connective tissue rather than merely obvious related pairs. oai:arXiv.org:2607.00005v1 cs.IR cs.A

fd6e271a030f1a34

Our code is available at https://github.com/MLAI-Yonsei/BaRA-Agent. oai:arXiv.org:2607.00007v1 cs.IR cs.AI Thu, 02 Jul 2026 00:00:00 -0400 new http://creativecommons.org/licenses/by/4.0/ Soojeong Lee, Joseph Lee, Yongseong Cho, Sunjae Kim, Youngwoo Moon, Kyungwoo Song SchemaRAG: Dynamic Large Schema Reduction for LLM-driven Structured Information Extraction https://arxiv.org/abs/2607.00008 arXiv:2

fd6e271a030f1a34

However, there are two key obstacles in the CRS domain: evaluation and access to training data. Evaluating CRSs through real human studies is more critical than for traditional recommender systems, yet such studies are both costly and time-consuming. Moreover, CRS interaction data are often difficult to obtain for model training due to privacy concerns. Large language model (LLM)-based user simula

fd6e271a030f1a34

SkillSelect-Serve represents raw skills as structured Skill Services with functional descriptions, dependencies, context cost, risk, and QoS-related attributes. A local Micro-Agent Requirement Planner converts natural-language tasks into structured service requirements, while a shared discovery backbone retrieves candidate services from a large registry. The framework then performs dual-granularit

fd6e271a030f1a34

PRA-RAG samples multiple combinations of retrieved texts and utilizes geometric structures in the embedding space to identify a robust subset, from which a stable aggregated representation is derived. We provide theoretical bounds on the maximum impact of poisoned retrieved content and establish a quantitative measure of RAG's robustness. Experiments across multiple benchmarks and RAG architecture

fd6e271a030f1a34

Request full access ← Catalog