The data scientist would apply their analytical, statistical, and programming skills to collect, process, and analyze large data sets related to CAS products and various scientific domains including chemistry, life sciences etc. They will work with other scientists and stakeholders to understand business problems, develop data-driven solutions, and communicate insights using data visualization techniques. The person would help in formulating strategy for data aggregation methods. The person would develop and implement a robust strategy for building deep data science capabilities with reliable team of experts, support roles, partnerships, and tools, etc. to deliver complex analytics and AI/ML projects.
About Company
ACS International India Pvt Ltd. (ACSI India) is a wholly owned subsidiary of ACS International, Ltd USA and a part of the American Chemical Society. ACSI India represents products and services provided by ACS divisions, including CAS (SciFinder, Biofinder Discovery Platforms and STN®) to the world's most important scientific companies, government organizations, global patent offices and academic institutions to promote research and discovery.
Position Responsibilities
- Create innovative solutions using generative AI techniques, such as developing/finetuning large language models (LLM), text generation systems and building chatbots, etc.
- Devise and utilize algorithms and models to mine big-data stores; perform data and error analysis to improve models; clean and validate data for uniformity and accuracy
- Improve LLM performance for tasks such as question answering, information extraction, long-form text generation, translation, and summarization, etc. to enhance its ability to generate outputs that are factually accurate and complete (reduce hallucinations and omissions, etc), especially in science tasks.
- Apply combination of NLP, machine learning, deep learning, statistical methods, classical algorithms, etc. in areas of forecasting, prediction, attribution, recommendation, user experience measurements, journey orchestration and personalization, etc.
- Create fundamental representation of the problem and design suitable experimental methodologies including algorithm selection, text extraction using NLP/NER, etc.
- Apply different optimization techniques including EDA, data mining, and other data-driven optimization.
- Implement and utilize MLOps for automation (CI/CD/CT) and streamlining of ML workflows for deployment, monitoring, and management of ML models in production at scale is desirable.
- Develop visuals and reports that effectively convey model outcomes, benefits, and insights to scientists / stakeholder using visualization tools such as PowerBi, Tableau etc.
The Ideal Candidate Will Have: The ideal candidate should be an expert in using advanced analytics and AI/ML techniques, data mining and modeling techniques, programming, and possess knowledge (intuition, theory including maths) for real-world business applications. They should be passionate about solving important problems in science landscape though data aggregations and analytics. They should contribute intellectually to AI operationalization.
Position Requirements
- Bachelor's degree or higher in computer science, statistics, mathematics, engineering, or related fields
- 5-8 years of post-degree experience working with diverse data sets
- Skills in generative AI, Large Language Models (LLMs), RAG, fine-tuning LLMs using innovative approaches (e.g., chain-of-thought prompting and self-consistency, chunking and conditioning based on local similarity, dialogue resolution, reinforcement learning with human feedback, etc.), prompt engineering. The ideal candidate must also be able to create frameworks for evaluation of LLM output based on human expert in the loop and quantitative metrics.
- Experience building applications for public cloud environments (AWS preferred).
- Proficiency in programming languages such as Java/Scala/JavaScript/TypeScript/Python/R.
- Proficiency in Linux/Unix environments.
- Experience with databases technologies (relational, NoSQL, RDF/triple store, Vector Database).
- Self-motivated, proactive and excellent in communication skills
- Experience in data visualization using tools like Power BI is desirable