Support Portal

for ProcessRobot and WinAutomation

Start a new topic
Answered

How to remove duplicates during web data extraction?

How can I extract element from below attached web page without taking duplicate items? I want to extract all the districts from 'company registered district' column but only 1 value for each district and remove repetition. Is it possible?

List.PNG
(63.8 KB)

Best Answer

Hello there,


In our testing, removing duplicates from a list with 70000 values, took some time to be completed (7 minutes) nevertheless, it completed successfully with no errors. You could just let the process run until it completes.


Having said the above, the most reliable and fast way to do this, would be to use an SQL query on an Excel, as Joseph mentioned in a previous reply.

https://www.w3schools.com/sql/sql_distinct.asp


Regarding the issue with the script slowing down your browser, please note that you could use the action "Get Details of Element on Web Page" to get the whole table in a text variable and then you could simply parse that text using regular expressions to extract the data that you need.


I hope the above makes sense.


1.  Create a temporary Excel File of the extracted web page data.

2.  Since you can do Exec SQL Select against an Excel File  - Use SQLSelect Distinct against the Excel to select your extracted data.

Sorry for my ignoranc. I am totally new at this. Canyou elaborate on your solution? The process can write to excel just fine when there are less data. But this particular list has 66885 entry. It shows 'failed to write in execel' error.
There should not be that limit on rows with V8 The softomotive team here can confirm this. It could be an issue with your excel version or an older version of wa. What versions of both are using? .... I suggest you open a ticket to report your write limit errror
I am using wa8 and office 2010. I guess it is a limit issue. Thanks for your help.
Gave it more thought and a regex expression can be put on the advanced css selector of the extract data from web page action. Now there probably should be a regex expression to drop all duplicates , except the 1st. Have to mull this one over.

^(.*)(\r?\n\1)+$ used this but no luck.I don't know anything about regex. Just googled and found it.

My list

1. Bagerhat

2. bandarban

3. bagerhat 

Turned into

1.

2.

3.


ADMIN

Hello there,


Could you please give us a little bit more information as to how you are extracting the data?


If after extracting the data, you end up with a list which holds all the districts, then you can simply use the "Remove Duplicate Items from List" action to remove all the duplicates.


I hope the above make sense.


2 people like this
Thank you for your reply. I tried with 'remove duplicates' action that works fine for small number of data say, 1000-1500. But this list contains more than 67000 value. Also I forgot to mention when i run the process in a fully loaded page firefox shows ' a script is slowing down your browser' with 2 options 'stop' and 'wait'. Whatever option i choose i get error from winautomation.
ADMIN
Answer

Hello there,


In our testing, removing duplicates from a list with 70000 values, took some time to be completed (7 minutes) nevertheless, it completed successfully with no errors. You could just let the process run until it completes.


Having said the above, the most reliable and fast way to do this, would be to use an SQL query on an Excel, as Joseph mentioned in a previous reply.

https://www.w3schools.com/sql/sql_distinct.asp


Regarding the issue with the script slowing down your browser, please note that you could use the action "Get Details of Element on Web Page" to get the whole table in a text variable and then you could simply parse that text using regular expressions to extract the data that you need.


I hope the above makes sense.


1 person likes this

Thanks. Successfully completed. Letting the process run did the trick. Took me 9 minutes. Not clicking on 'wait for script' was all i need. Then used variable action remove duplicates from list. Regular expression would be better/faster but this works just fine instead.

Login or Signup to post a comment