arXiv cs.IR daily listing (RSS)
1 docs · 35 chunks · created Thu, 02 Jul 2026 23:50:49 GMT
Provenance
Source: https://rss.arxiv.org/rss/cs.IR · extractor heuristic · chunking sentence
OKF dataset card
--- type: Dataset title: arXiv cs.IR daily listing (RSS) description: "Point-in-time collection: 1 docs, 35 chunks." tags: [dataset, datalab] timestamp: "2026-07-02T23:50:49Z" datasetId: 20260702-185049-arxiv-cs-ir-daily-listing--rss --- # Provenance Point-in-time collection run `20260702-185049-arxiv-cs-ir-daily-listing--rss` — 1 documents, 35 chunks (sentence strategy, ~16206 tokens). Raw artifacts live at `data/datalab/20260702-185049-arxiv-cs-ir-daily-listing--rss/` (docs.jsonl, chunks.jsonl, manifest.json); trace `20260702-185048-55bb350d` in `data/traces/`. # Sources | Source | Status | |--------|--------| | `https://rss.arxiv.org/rss/cs.IR` | ok | # Consumption Chunks feed the training worker's pair builder and the retrieval index. Filter on `token_estimate` and `strategy` in `chunks.jsonl`. See the [data pipeline](/platform/data-pipeline.md) concept for stage details.
Sample chunks (first 20, sanitized)
cs.IR updates on arXiv.org http://rss.arxiv.org/rss/cs.IR cs.IR updates on the arXiv.org e-print archive. http://www.rssboard.org/rss-specification en-us Thu, 02 Jul 2026 04:00:09 +0000 rss-help@arxiv.org Thu, 02 Jul 2026 00:00:00 -0400 Sunday Saturday From "Strings" to "Things" for Personal Knowledge Graphs: Evaluating LLM Triple Extraction for Recommendation Systems https://arxiv.org/abs/2607.00
The Answer and an Approach to Bridging Vocabulary Gaps https://arxiv.org/abs/2607.00004 arXiv:2607.00004v1 Announce Type: new Abstract: While advanced foundation models like ModernBERT significantly outperform older architectures in dense retrieval, they surprisingly lag behind the aging BERT-base baseline in learned sparse retrieval (LSR). We identify the root cause as the \textit{Vocabulary Gap}
We've released our code and models.\footnote{https://anonymous.4open.science/r/vocab-transfer/. All details included.} oai:arXiv.org:2607.00004v1 cs.IR cs.AI cs.LG Thu, 02 Jul 2026 00:00:00 -0400 new http://creativecommons.org/licenses/by/4.0/ 10.1145/3805712.3809724 Zhichao Geng, Yang Yang Topological Void Analysis A Mathematical Framework for Systematic Technical Innovation Discovery in Knowledg
Applied to ~140k indexed documents, TVA generates 2,128 invention candidates across 96 targets; 90% survive automated quality filtering, yielding 191 REVISE and 1 APPROVE verdict from four-specialist adversarial review (0.05% end-to-end). Two case studies demonstrate the framework surfaces non-obvious connective tissue rather than merely obvious related pairs. oai:arXiv.org:2607.00005v1 cs.IR cs.A
Our code is available at https://github.com/MLAI-Yonsei/BaRA-Agent. oai:arXiv.org:2607.00007v1 cs.IR cs.AI Thu, 02 Jul 2026 00:00:00 -0400 new http://creativecommons.org/licenses/by/4.0/ Soojeong Lee, Joseph Lee, Yongseong Cho, Sunjae Kim, Youngwoo Moon, Kyungwoo Song SchemaRAG: Dynamic Large Schema Reduction for LLM-driven Structured Information Extraction https://arxiv.org/abs/2607.00008 arXiv:2
However, there are two key obstacles in the CRS domain: evaluation and access to training data. Evaluating CRSs through real human studies is more critical than for traditional recommender systems, yet such studies are both costly and time-consuming. Moreover, CRS interaction data are often difficult to obtain for model training due to privacy concerns. Large language model (LLM)-based user simula
SkillSelect-Serve represents raw skills as structured Skill Services with functional descriptions, dependencies, context cost, risk, and QoS-related attributes. A local Micro-Agent Requirement Planner converts natural-language tasks into structured service requirements, while a shared discovery backbone retrieves candidate services from a large registry. The framework then performs dual-granularit
PRA-RAG samples multiple combinations of retrieved texts and utilizes geometric structures in the embedding space to identify a robust subset, from which a stable aggregated representation is derived. We provide theoretical bounds on the maximum impact of poisoned retrieved content and establish a quantitative measure of RAG's robustness. Experiments across multiple benchmarks and RAG architecture