Next up are the encapsulated database operations, as multiple processes are used as well as multi-threading each process, they need to know which URLs have been crawled and which URLs need to be crawled! Let's set two states for each URL.
outstanding:URLs waiting to be crawled
complete:Crawl completed URL
Mmm! When an all initial URLs status is OUTSTANDING; when the crawl starts status changes to: PROCESSING; crawl completion status changes to: COMPLETE; failed URLs reset status to: OUTSTANDING.
To be able to handle the case where the URL process is terminated, we set a timing parameter and reset the state to outstanding when this value is exceeded.