Sber

Data Engineer for VLM Training Data (GigaChat Vision)

6.0/10
Sber
Not specified
Office / on-site
mid
about 3 hours ago
AI SummaryVerified by Aipplify AI

The vacancy is strong in task clarity and requirements but lacks compensation details.

AI quality score6.5 / 10

Check Match โ€” Just drop your CV

See your fit for Data Engineer for VLM Training Data (GigaChat Vision) in seconds.

Overview

Join Sber as a Data Engineer to work on VLM training data, focusing on data pipelines and ML team needs. Sber is a leading financial institution in Russia, providing a wide range of banking and financial services.

What you'll do

  • โ€ขGather and structure the ML team's data needs for training, fine-tuning, evaluation, and improvement of VLM.
  • โ€ขPropose and implement ideas for data cleaning, filtering, deduplication, categorization, and generation pipelines.
  • โ€ขNavigate modern practices for building datasets for Vision-Language Models: image-text pairs, synthetic data, filtering, quality scoring, data mixture design, dataset versioning.
  • โ€ขBe responsible for the infrastructure for data storage and preparation, including:
  • โ€ขimporting data from various sources: production, Common Crawl, open-source datasets, generated data;
  • โ€ขvalidating and controlling data quality;
  • โ€ขstoring and versioning datasets;
  • โ€ขexporting data in formats suitable for model training.
  • โ€ขDesign and implement data processing pipelines at scale, including tens of billions of images.
  • โ€ขDevelop pipelines for generating synthetic data for training and improving VLM.
  • โ€ขCollect statistics on data, build reports and visualizations for analyzing the composition, quality, and coverage of datasets.
  • โ€ขEnsure reproducibility, observability, and reliability of data processes.
  • โ€ขWork closely with ML engineers, researchers, and the infrastructure team.
Loading similar jobs...