Commit Graph

5 Commits

Author SHA1 Message Date
wasrusgen
c97b8dce3c parsers: skip sponsored/ad URLs (cpc/sponsored=1) — they expire in 2-3 hours
User reported clicking matrix prices led to 'Произошла ошибка!' on OZON home page.
Cause: parsers captured /product/?sponsored=1&cpc=Jtiito95... links that died after few hours.

Fix:
- ozon.py: skip href with 'sponsored=1', '/promo/', 'cpc='. Strip query string from final URL.
- yamarket.py: skip 'sponsored=1', 'cpc=', 'advUuid' (Я.Маркет sponsored marker)
- citilink.py: strip query string from final URL (defensive)

Now matrix links go to canonical product pages that don't expire.
2026-05-11 17:20:59 +03:00
wasrusgen
b27cf02aa2 yamarket: clean React JSON noise + extract title from URL slug
Я.Маркет рендерит SnippetConstructor виджет с JSON-стейтом ВНУТРИ a-тега.
Поэтому link.get_text() возвращает мусор типа {'widgets':{...}}.

Фикс:
- copy.copy(card) и удаление <script>/<noscript>/<noframes>/<template>
- Title теперь берётся из URL slug первым приоритетом (всегда чистый)
- _slug_to_title: транслитерация и капитализация
  'bosch-kgn39ul30u-dvukhkamernyy-kholodilnik-no-frost-seryy-metallik' →
  'Bosch KGN39UL30U Двухкамерный Холодильник NoFrost Серый Металлик'
2026-05-11 16:30:34 +03:00
wasrusgen
839e775151 yamarket: rewrite for /card/{slug}/{id} URL pattern (Я.Маркет 2026)
- Old /product--{id} URLs deprecated
- Walks up from a[href*='/card/'] to nearest article/zone-div
- Extracts title from link text or h2/h3/itemprop=name
- Price: min from card text (with sanity bounds 100..10M)
- Image filters yastatic / _next placeholders
- Rating: '4.7★' or '4.7 N оценок' pattern
- Reviews: 'N отзывов' / 'N оценок'
- Stores count: 'от N магазинов / предложений'
2026-05-11 16:26:28 +03:00
wasrusgen
d5f290bd0a backend: Playwright + Chromium for JS-rendered sites (Я.Маркет, OZON fallback)
DOCKERFILE:
- + Chromium system deps (libnss3, libxkbcommon0, libgbm1, libgtk-3-0, etc.)
- + RUN python -m playwright install chromium (~150MB)
- + ENV PLAYWRIGHT_BROWSERS_PATH

REQUIREMENTS:
- + playwright >= 1.45

PARSERS:
- new playwright_engine.py — singleton browser, isolated context per request,
  blocks images/fonts/CSS to save memory, waits for selector + JS hydration
- yamarket.py — rewritten to use Playwright (Я.Маркет is React SPA)
- ozon.py — Playwright fallback when composer-api returns challenge (403)
- wb.py — exponential backoff on 429, still uses direct HTTP (JSON API, no JS needed)

STRATEGY (Hybrid Path C):
- Я.Маркет: Playwright (rendering JS)
- OZON: composer-api first, Playwright fallback
- WB: direct HTTP with backoff (JSON API, fast)
- DNS: kept but lower priority (Qrator hard to crack)
- No more proxy needed for primary path

DEPLOY: removed PROXY_STATIC_LIST from .env, expect ~5min for first build (Chromium download)
2026-05-11 13:25:05 +03:00
wasrusgen
82425dbd88 backend: Proxy6 pool + parsers WB / OZON / Я.Маркет / DNS
PROXY POOL (app/proxy_pool.py):
- Loads active proxies from Proxy6.net API every 10 min
- Random rotation per request via proxied_client(timeout, headers)
- Graceful fallback to direct HTTP if PROXY6_TOKEN not set
- Config: PROXY6_TOKEN env var

PARSERS (app/parsers/):
- dns.py — refactored to use proxy_pool with retry+rotation on Qrator block
- wb.py — Wildberries JSON API (search.wb.ru), retries on 429
- ozon.py — OZON composer-api JSON (widgetStates extraction)
- yamarket.py — Я.Маркет HTML + embedded JSON parser
- __init__.py — enrich_one() fans out to all sources, aggregates min/max prices, max rating, sum reviews
- enrich_models() — batch enrich for AI by_category output

NEW DIAGNOSTIC ENDPOINTS (main.py):
- GET /api/parse_wb?q=...&limit=N
- GET /api/parse_ozon?q=...&limit=N
- GET /api/parse_yamarket?q=...&limit=N
- GET /api/parse_all?q=... — fan-out + aggregate
- GET /api/proxy_status — pool diagnostics (count, token configured, age)

PODBOR (main.py):
- _enrich_ai_with_dns -> _enrich_ai_marketplaces (uses all sources)

DEPLOY: needs PROXY6_TOKEN in /opt/zov-tech/deploy/.env on VPS, then docker compose build + up -d backend
2026-05-11 12:18:04 +03:00