Data cleaning experience

Normally used to experimenting on some specific set of data, simple steps of tokenization, preprocessing, etc. are sufficient. But in the age of ever-larger data, data cleansing is becoming increasingly important and complex. Seeing this English article Parsing Raw Data ( by Philip J. Guo, I feel good to learn and translate it into Chinese.


Scientific researchers, engineers, and business analysts are professions that work with data, and data analysis is a core task in their work. Data analysis is not just for "big data" practitioners; even the data on your laptop's hard drive is worth analyzing.

The first step in data analysis is washing the data, and raw data may come from a variety of different sources, including.

Web server logs

The output of a scientific instrument

Exported results of the online questionnaire

Government data for the 1970s

Reports prepared by business consultants

What these sources have in common is this: you would never expect them to be in a variety of bizarre formats. The data are given to you, then they have to be processed, but they may often be.

Incomplete (some fields are missing from some records)

Inconsistencies (inconsistent field names and structures)

Data corruption (some records may be corrupted for various reasons)

Therefore, you must often maintain your cleanser to cleanse this raw data and transform them into an easy-to-analyze format, often called data wrangling. Next will be some information on how to clean data effectively, all presented can be implemented by any programming language.

Using assertions

This is the most important lesson: use assertions to uncover bugs in your code. Write down your assumptions about the format of the code in the form of assertions, and if you find any data that contradicts your assertions, modify those assertions.

Is the record in order? If so, assert it! Does each record have 7 fields? If so, the assertion of. Is every field an odd number between 0 and 26? If so, assert it! In short, assert what you can!

In an ideal world, all records should be neatly formatted and follow some kind of concise internal structure. But that's not the case in practice.

Write assertions until your eyes bleed, and even if they do bleed you have to write them again.

Programs that wash data are sure to crash a lot. That's good, because every crash means that you have these bad numbers going against your initial assumptions again. Repeatedly improve your assertions until you can successfully walk through them. But be sure to keep them as strict as possible and not too lenient, or you may not achieve the results you want. The worst that can happen is not that the process doesn't work, but that it doesn't come out the way you want it to.

Don't silently skip the record

Some records in the raw data are incomplete or corrupted, so the procedure for washing the data has to be skipped. Silently skipping these records is not the best way to go because you don't know what data is missing. So, it's better:

Print out the warning message so you can go back later and find out what went wrong.

Records how many records were skipped in total and how many records were successfully cleaned. Doing so gives you a general sense of the quality of the raw data, for example, if it only skips 0.5%, that's not too bad. But if it skips 35%, then it's time to look at what's wrong with those numbers or the code.

Use Set or Counter to store the categories of the variables and how often the categories occur

Often some fields in the data are of enumeration type. For example, blood types can only be A, B, AB or O. While it's nice to use assertions to limit blood types to only one of these 4, you can't use assertions if a category contains multiple possible values, especially if there are values you might not expect. This is where a data structure like COUNTER would work better for storage. By doing so you can.

For a category, you can print a message to alert you if you encounter a new value that you didn't expect.

After washing the data for you to check backwards. For example, if someone had mistakenly entered their blood type as C, that would be easy to spot in retrospect.

Breakpoint Cleaning

If you have a lot of raw data to clean, it can take a long time to clean it all at once, possibly 5 minutes, 10 minutes, an hour, or even days. In practice, it often suddenly breaks down in the middle of a wash.

Let's say you have a million records and your purge program crashes at entry 325392 due to some exception, and you modify the bug and purge again, in which case the program has to purge again from 1 to 325391, which is doing nothing. You can actually do this: 1. Have your purge program print out what entry it's currently purging so that if it crashes, you'll know which entry it crashed when processing. 2. Have your program support starting the purge at the breakpoint so that when the purge is re-run, you can start directly from 325392. It is possible that the rewashed code will crash again, you just have to fix the bug again and then start with the record that crashed again.

When all records have been cleaned, they are cleaned again, because later modification of the buggy code may bring some changes to the cleaning of the previous records, and two cleanings are guaranteed to be foolproof. But in general, setting breakpoints can save a lot of time, especially if you're debugging.

Testing on a portion of the data

Do not try to clean all the data at once. When you first start writing cleaning code and debug, test on a smaller subset of the scale, then expand that subset of tests and test again. The purpose of this is to be able to get your cleaning program to finish cleaning on the test set very quickly, say a few seconds, which will save you time in repeated testing.

Note, however, that in doing so, the subset used for testing often does not cover some of the oddball records, since oddballs are always relatively rare.

Print the cleaning log to a file

When running the purge program, print the purge logs and error messages to a file so that you can easily view them using a text editor.

Optional: store the raw data together

This one lesson is still useful when you don't have to worry about storage space. This allows the original data to be saved as a field in the cleaned data, and after the cleaning, if you find any record that doesn't look right, you'll be able to see directly what the original data looks like, making it easy to debug.

However, the downside of this is that it takes double the storage space and makes certain cleaning operations slower. So this one only applies as much as efficiency allows.

Validation of cleaned data

Remember to write a validation program to verify that the clean data you get after cleaning matches the format you expect. You can't control the format of the raw data, but you can control the format of the clean data. So, always make sure that the clean data is in the format that you expect.

This is actually very important, because once you're done cleaning the data, the next step will be to work directly on that clean data. You'll never even touch that raw data again if you don't have to. Therefore, make sure the data is clean enough before you start data analysis. Otherwise, you may get the wrong analysis results, and at that point, it will be hard to spot the mistakes made during the data cleaning process long ago.

1、Train neural networks directly on photon computing devices Optical circuit AI is one step closer
2、Fairness and Integrity XCTF in the spirit of Geek for four years
3、Consensus Algorithm Series Sharing 2 PoW
4、Chennaiyin Technology abandoned the car to protect the handsome wrestling earnings chain into the abandoned son
5、Ruian City island information management system construction project successfully passed the acceptance

    已推荐到看一看 和朋友分享想法
    最多200字,当前共 发送