Working with a real-world Problem in Data Science
Working with real-world dataset is not as easy as we see it while learning. Working with Kaggle data, Zindi data is very easy compared to getting data yourself.
When you are working with a real-world problem you don’t always have the dataset ready. The first step here is mining your data. Data comes in different formats, so we have several data-mining techniques.
Data Collection
Data collection is the most important part of data science, data collection plays a great role in determining how well the analysis of data goes. Data comes in different format like csv, tsv, xlsx, html and so on.
Data Collection Techniques
- Interviews
- Questionnaires and Surveys
- Observations
- Focus Groups
- Ethnographies, Oral History, and Case Studies
- Documents and Records
- Web Scraping
Here is a link to where you can read more on several data collection techniques https://cyfar.org/data-collection-techniques
Data Cleaning
Once you have your data ready the next thing you have to do is to clean your data. Data Cleaning is the process of identifying and removing unwanted observations from the data. Data cleaning process could be the removal of unwanted observations, removal of outliers, filling of missing rows, creation of calculated column, symbols.
Define your question
In data analysis, questions should be measurable, clear and concise. Questions should be designed to qualify or disqualify a potential solution to a problem. In the advertising industry questions like ‘Does age affect rate at which people subscribe to this service’, ‘How does gender affect the type advert would like to see ?’. This is done so as to understand the solution we are working on better. This can help to target people that are likely to use a particular product, people that are likely to subscribe to a particular channel.
Set clear measurement Priority
This can come in two different ways:
- Decide what to measure
- Decide how to measure.
One of the key challenges with performance management is selecting what to measure. The priority here is to focus on quantifiable factors that are clearly linked to the drivers of success in business
Analyze your data
Data could be manipulated in a number of ways, such as plotting it out, creating pivot tables, group by a particular category. Tools like pandas, excel, tableau, power bi are very useful in data analysis.
Interpret Result
After analyzing data the next step is to interpret the analysis, this step is where conclusions are made whether a hypothesis fails or is accepted.
The Conclusion
As you can see, data is not always available. You have to be careful with privacy and licenses. Encrypt all personal data before sending out to the public, Read robot.txt of websites before scraping, remove all access token or keys before sharing your code, data with the public.
Thanks for Reading.
Cheers!