zov-tech/backend-py/app/parsers
wasrusgen 1a57374020 parsers: better image extraction — real product photos in report cards
CITILINK:
- Now reads data-src / data-original / srcset / src in priority order
- srcset → picks largest size variant (last in comma-list)
- Filters only _next/static/images (placeholder) and 'placeholder' in URL
- Accepts cs.citilink.ru / c.citilink.ru / images.citilink.ru product photos

ЯНДЕКС.МАРКЕТ:
- Collects all img attrs (data-src, data-original, srcset, data-srcset, src)
- Prefers avatars.mds.yandex.net (real product CDN), skips yastatic (icons/logos)
- Auto-appends /300x300 suffix to avatars.mds URLs without size

ENRICH_ONE (aggregator):
- Image picked by source priority: yamarket > wb > ozon > citilink > dns
- Yamarket photos are cleanest (avatars.mds.yandex.net)
- WB has product photos via basket-XX.wbbasket.ru
2026-05-11 23:43:25 +03:00
..
__init__.py parsers: better image extraction — real product photos in report cards 2026-05-11 23:43:25 +03:00
citilink.py parsers: better image extraction — real product photos in report cards 2026-05-11 23:43:25 +03:00
dns.py dns+ozon: 4 retries with proxy rotation (residential pool has dirty IPs) 2026-05-11 16:37:28 +03:00
ozon.py parsers: skip sponsored/ad URLs (cpc/sponsored=1) — they expire in 2-3 hours 2026-05-11 17:20:59 +03:00
playwright_engine.py playwright_engine: route through proxy_pool — random residential IP per request 2026-05-11 16:05:36 +03:00
wb.py wb: relevance filter — discard anti-bot trash products (платья/обувь in fridge search) 2026-05-11 23:02:37 +03:00
yamarket.py parsers: better image extraction — real product photos in report cards 2026-05-11 23:43:25 +03:00