Reply
 
LinkBack Thread Tools Search this Thread Display Modes
  #81   Report Post  
Posted to microsoft.public.excel.programming,microsoft.public.excel
external usenet poster
 
Posts: 1,182
Default Read (and parse) file on the web

Maybe i was not too clear.
Case one:
Using a browser, log to https://www.nsncenter.com/ and give it a
search term: 5960&REGULATOR&"ELECTRON TUBE" in the NSN box, and click
on the WebFLIS Search green button.
Then use the browser "File" pulldown, select Save Page As and
modify the extension to .TXT
The resulting file is a bit different than what one sees in other
methods.
Case two:
Choose a method of getting the search results; a given search term
will always produce the same results (ie: reproducible), and small
changes of the search term may give different results - and THOSE
DIFFERENCES are some of what i am talking about.
Case three:
Choose a given search term, and compare results between various
methods; DIFFERENCES may be huge, also some of what i am talking
about.

In case three, with your program, whatever is happening gives a
radically different result. And that result is VERY useful.

For some unknown reason, your program/macro refuses to run, and
gives the following error message: "Can't find project or library".

Would you be so kind as to modify the search term in your program
to 5960&REGULATOR&"ELECTRON TUBE" and run it? and please send the
results


Robert,
I'm fully aware of what you're saying. That's just how search engines
work! The most important statement I made to you is...

"You need to decide what URL string you want to run with and set
both the URL cell and gsUrl1 strings to that."

...because that string determines the resulting page source structure
for the search. ANY URL string other than what is hard-coded in gsUrl1
will not likely work with my code because that code is designed to work
with the structure of the resulting page source it uses. That's why
there are 2 methods of parsing the same webpage; AutoParse and Sheet1
are both coded differently to reflect the fact that...

even with the same URL string, both methods have to be structured
to work with the underlying document.

So if you change the search criteria then expect a different page
source for which you'll need to find out how the structure needs to be
parsed in terms of what line numbers the data you want rests on *in
relation to* the string the code searches for to establish 'lStart' and
'lEnd' in their respective "ParseParentPage" and "ParseChildPage"
procedures. For example, when you added "ELECTRON TUBE" to the URL
string I had to revise the hard-code start position of the 1st loop in
both procedures to match the new page source...

Sheet1.ParseParentPage
<snip
vDoc = Split(sTxt, vbNewLine)
For n = 120 To UBound(vDoc)
If InStr(vDoc(n), sLinkTxt) 0 Then _
lStart = n: lEnd = n + 60: Exit For
Next 'n

m_OPenClose.ParseParentPage
<snip
vDoc = Split(StripTags(ReadTextFile(sTxt)), vbNewLine)
For n = 7900 To UBound(vDoc)
If InStr(vDoc(n), sLinkTxt) 0 Then _
lStart = n: lEnd = n + 160: Exit For
Next 'n

So then, entering 5960&REGULATOR&"ELECTRON TUBE" into the search box
returns this URL string...

NSNSearch?q=5960%26REGULATOR%26%22ELECTRON+TUBE%22

appended to...

https://www.nsncenter.com/

...in the address bar. This returns a browser page source containing
8147 lines. Using SaveAs "txt" creates a 411 line file with all the
html tags stripped out of this page source. *This is why I show both
methods* of parsing: *They code differently to match the page source
'structure' they are parsing*!

The browser source finds the 1st item link on line 7920;
(This is what AutoParse works with via URLDownloadToFile)
Sheet1.txtPgSrc finds that same link on line 129;
(This is what clicking btnViewSrc returns as
WebBrowser1.Document.body.innerHTML)
The SaveAs "txt" file finds that same item link on line 296;

If you click btnViewSrc and then in the VBE Immediate Window:
Typing "sheet1.txtPgSrc.Text=striptags(sheet1.txtPgSrc.Te xt)"
then typing "?sheet1.txtPgSrc.LineCount"
returns 274 lines where the same item link is on line 127.
(This is essentially same as using SaveAs "txt" without the
ignored HTML tags "<https://...")

Same webpage, different page sources, so different code is needed
to parse each according to how the data sits within the text!

So whatever search criteria you stumble onto as being the best to run
with, revising my code to work with the URL string is simple as 1-2-3:

1 Hard-code the base string; gsUrl2$ = "https://www.nsncenter.com/"

2 Hard-code the search string appended to gsUrl2 as gsUrl1$;
"NSNSearch?q=5960%20regulator%20%22ELECTRON%20TUBE %22&PageNumber=";

Optionally, just go to gsUrl2 and enter your search criteria, then
copy the AddressBar contents and append "&PageNumber=" to it!

3 Revise existing code to work with its page source for both
parent/child pages in m_OpenClose and Sheet1.

The 2 methods my project demos are what I find to be most reliable and
consistent in terms of structure for parsing. Problem today is the NSN
search engine returns different page source than it did when I sent you
the last revision. Perfect time for you to get your hands dirty
revising the code to work with the new page source. Though, it
shouldn't matter what the source is if we have consistent html tags for
the data.

You get your hands dirty; I'll write generic code so it doesn't matter
what the page source structure is...

--
Garry

Free usenet access at http://www.eternal-september.org
Classic VB Users Regroup!
comp.lang.basic.visual.misc
microsoft.public.vb.general.discussion

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

  #82   Report Post  
Posted to microsoft.public.excel.programming,microsoft.public.excel
external usenet poster
 
Posts: 1,182
Default Read (and parse) file on the web

Note that *now* NSN is returning multiple part numbers by multiple
suppliers, making the logic of approach entirely new. This effectively
puts us back to square 1!<Ugh!

--
Garry

Free usenet access at http://www.eternal-september.org
Classic VB Users Regroup!
comp.lang.basic.visual.misc
microsoft.public.vb.general.discussion

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

  #83   Report Post  
Posted to microsoft.public.excel.programming,microsoft.public.excel
external usenet poster
 
Posts: 1,182
Default Read (and parse) file on the web - revised

I was able to redo this for the new source page structure, without too
much effort...

https://app.box.com/s/23yqum8auvzx17h04u4f

Note that it now captures multiple items listed on the same child page.
Also, the item# is now the MCRL# so the headings have been changed to:

Item#,Part#,Cage,Source

If you don't get correct/expected results in future, redo the code for
the new source page. *Still better to access the database directly*!<g

--
Garry

Free usenet access at http://www.eternal-september.org
Classic VB Users Regroup!
comp.lang.basic.visual.misc
microsoft.public.vb.general.discussion

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

  #84   Report Post  
Posted to microsoft.public.excel.programming,microsoft.public.excel
external usenet poster
 
Posts: 1,182
Default Read (and parse) file on the web

I understand now what you are doing from your very nice
description.
BUT.
For some unknown reason, your UNMODIFIED program/macro refuses to
run, and gives the following error message: "Can't find project or
library".
It obviously works on your (Win7SP1?) system, but not mine.

Would you be so kind as to modify the search term in your program
and run it there?
Term: 5960&REGULATOR&"ELECTRON TUBE"


It displays a different pg1, but the code works just the same. If you
follow how the code executes, the links list *must* be loaded in
gvLinks *before* you can run GetItemInfo. So on Sheet1...

Load a main webpage
View Source
Get Item Links

...to load gvLinks. Then click on a link in WebBrowser1, View Source,
then Get Item Info.

Right click 'Back' to choose another link, View Source, then Get Item
Info. And so on for each link on the parent page. If you switch parent
page then start the whole process again!

--
Garry

Free usenet access at http://www.eternal-september.org
Classic VB Users Regroup!
comp.lang.basic.visual.misc
microsoft.public.vb.general.discussion

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

  #85   Report Post  
Posted to microsoft.public.excel.programming,microsoft.public.excel
external usenet poster
 
Posts: 1,182
Default Read (and parse) file on the web

GS wrote:
Note that *now* NSN is returning multiple part numbers by multiple
suppliers, making the logic of approach entirely new. This
effectively
puts us back to square 1!<Ugh!

When manually doing this for the first 30 pages, i saw some that
had multiple MCRN line items for a given NSN; the first line item was
always useful and more to the point reliably related.
One case was so bad that it was really a magnetron, definitely NOT
a regulator.
So, i wrote my Excel program to use only the first one.

I cannot use URLDownloadToFile; it fails for two reasons: 1)
virtual spaces (%20) kill it; 2) https: kills it.
No fixes seen in the forums; one person seems to have it working.


My AutoParse sub uses URLDownloadToFile exclusively. It passes the URL
string with the appropriate %tags (%20, %20%22, %22) and works just
fine!

So, i use:
Workbooks.Open Filename:=openFYL$ 'YES..opens as R/O
With Worksheets(1)
' .Copy ''do not need; saves BOOKn space
.SaveAs sav$
End With

Does not work in Win2K; works in WinXP and in Win7.

But i get different results than you do, using what seems to be
the exact same search term (and variants as well).


I will oen the project on my XP machine and see what happens there...

--
Garry

Free usenet access at http://www.eternal-september.org
Classic VB Users Regroup!
comp.lang.basic.visual.misc
microsoft.public.vb.general.discussion

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus



  #86   Report Post  
Posted to microsoft.public.excel.programming,microsoft.public.excel
external usenet poster
 
Posts: 93
Default Read (and parse) file on the web

GS wrote:
Maybe i was not too clear. Case one: Using a browser, log to
https://www.nsncenter.com/ and give it a search term:
5960&REGULATOR&"ELECTRON TUBE" in the NSN box, and click on the
WebFLIS Search green button. Then use the browser "File" pulldown,
select Save Page As and modify the extension to .TXT The resulting
file is a bit different than what one sees in other methods. Case
two: Choose a method of getting the search results; a given search
term will always produce the same results (ie: reproducible), and
small changes of the search term may give different results - and
THOSE DIFFERENCES are some of what i am talking about. Case three:
Choose a given search term, and compare results between various
methods; DIFFERENCES may be huge, also some of what i am talking
about.

In case three, with your program, whatever is happening gives a
radically different result. And that result is VERY useful.

For some unknown reason, your program/macro refuses to run, and
gives the following error message: "Can't find project or
library".

Would you be so kind as to modify the search term in your program
to 5960&REGULATOR&"ELECTRON TUBE" and run it? and please send the
results


Robert, I'm fully aware of what you're saying. That's just how
search engines work! The most important statement I made to you
is...

"You need to decide what URL string you want to run with and set
both the URL cell and gsUrl1 strings to that."

..because that string determines the resulting page source structure
for the search. ANY URL string other than what is hard-coded in
gsUrl1 will not likely work with my code because that code is
designed to work with the structure of the resulting page source it
uses. That's why there are 2 methods of parsing the same webpage;
AutoParse and Sheet1 are both coded differently to reflect the fact
that...

even with the same URL string, both methods have to be structured to
work with the underlying document.

So if you change the search criteria then expect a different page
source for which you'll need to find out how the structure needs to
be parsed in terms of what line numbers the data you want rests on
*in relation to* the string the code searches for to establish
'lStart' and 'lEnd' in their respective "ParseParentPage" and
"ParseChildPage" procedures. For example, when you added "ELECTRON
TUBE" to the URL string I had to revise the hard-code start position
of the 1st loop in both procedures to match the new page source...

Sheet1.ParseParentPage <snip vDoc = Split(sTxt, vbNewLine) For n =
120 To UBound(vDoc) If InStr(vDoc(n), sLinkTxt) 0 Then _ lStart =
n: lEnd = n + 60: Exit For Next 'n

m_OPenClose.ParseParentPage <snip vDoc =
Split(StripTags(ReadTextFile(sTxt)), vbNewLine) For n = 7900 To
UBound(vDoc) If InStr(vDoc(n), sLinkTxt) 0 Then _ lStart = n: lEnd
= n + 160: Exit For Next 'n

So then, entering 5960&REGULATOR&"ELECTRON TUBE" into the search box
returns this URL string...

NSNSearch?q=5960%26REGULATOR%26%22ELECTRON+TUBE%22

appended to...

https://www.nsncenter.com/

..in the address bar. This returns a browser page source containing
8147 lines. Using SaveAs "txt" creates a 411 line file with all the
html tags stripped out of this page source. *This is why I show both
methods* of parsing: *They code differently to match the page source
'structure' they are parsing*!

The browser source finds the 1st item link on line 7920; (This is
what AutoParse works with via URLDownloadToFile) Sheet1.txtPgSrc
finds that same link on line 129; (This is what clicking btnViewSrc
returns as WebBrowser1.Document.body.innerHTML) The SaveAs "txt"
file finds that same item link on line 296;

If you click btnViewSrc and then in the VBE Immediate Window: Typing
"sheet1.txtPgSrc.Text=striptags(sheet1.txtPgSrc.Te xt)" then typing
"?sheet1.txtPgSrc.LineCount" returns 274 lines where the same item
link is on line 127. (This is essentially same as using SaveAs "txt"
without the ignored HTML tags "<https://...")

Same webpage, different page sources, so different code is needed to
parse each according to how the data sits within the text!

So whatever search criteria you stumble onto as being the best to
run with, revising my code to work with the URL string is simple as
1-2-3:

1 Hard-code the base string; gsUrl2$ = "https://www.nsncenter.com/"

2 Hard-code the search string appended to gsUrl2 as gsUrl1$;
"NSNSearch?q=5960%20regulator%20%22ELECTRON%20TUBE %22&PageNumber=";

Optionally, just go to gsUrl2 and enter your search criteria, then
copy the AddressBar contents and append "&PageNumber=" to it!

3 Revise existing code to work with its page source for both
parent/child pages in m_OpenClose and Sheet1.

The 2 methods my project demos are what I find to be most reliable
and consistent in terms of structure for parsing. Problem today is
the NSN search engine returns different page source than it did when
I sent you the last revision. Perfect time for you to get your hands
dirty revising the code to work with the new page source. Though, it
shouldn't matter what the source is if we have consistent html tags
for the data.

You get your hands dirty; I'll write generic code so it doesn't
matter what the page source structure is...

I understand now what you are doing from your very nice description.
BUT.
For some unknown reason, your UNMODIFIED program/macro refuses to
run, and gives the following error message: "Can't find project or library".
It obviously works on your (Win7SP1?) system, but not mine.

Would you be so kind as to modify the search term in your program and
run it there?
Term: 5960&REGULATOR&"ELECTRON TUBE"

  #87   Report Post  
Posted to microsoft.public.excel.programming,microsoft.public.excel
external usenet poster
 
Posts: 93
Default Read (and parse) file on the web

GS wrote:
Note that *now* NSN is returning multiple part numbers by multiple
suppliers, making the logic of approach entirely new. This effectively
puts us back to square 1!<Ugh!

When manually doing this for the first 30 pages, i saw some that had
multiple MCRN line items for a given NSN; the first line item was always
useful and more to the point reliably related.
One case was so bad that it was really a magnetron, definitely NOT a
regulator.
So, i wrote my Excel program to use only the first one.

I cannot use URLDownloadToFile; it fails for two reasons: 1) virtual
spaces (%20) kill it; 2) https: kills it.
No fixes seen in the forums; one person seems to have it working.

So, i use:
Workbooks.Open Filename:=openFYL$ 'YES..opens as R/O
With Worksheets(1)
' .Copy ''do not need; saves BOOKn space
.SaveAs sav$
End With

Does not work in Win2K; works in WinXP and in Win7.

But i get different results than you do, using what seems to be the
exact same search term (and variants as well).


  #88   Report Post  
Posted to microsoft.public.excel.programming,microsoft.public.excel
external usenet poster
 
Posts: 93
Default Read (and parse) file on the web

GS wrote:
It displays a different pg1, but the code works just the same. If you
follow how the code executes, the links list *must* be loaded in gvLinks
*before* you can run GetItemInfo. So on Sheet1...

Load a main webpage
View Source
Get Item Links

..to load gvLinks. Then click on a link in WebBrowser1, View Source,
then Get Item Info.

Right click 'Back' to choose another link, View Source, then Get Item
Info. And so on for each link on the parent page. If you switch parent
page then start the whole process again!

--
Garry

Will give that a try.
Meanwhile i deduced a few thing via search variants.
1) A space between terms (5060 REGULATOR 63060) is a default logical OR,
so results NOT what one would normally think.
2) Case matters; since a lower case term is very unlikely to match
anything, the "About" count can be in the region of 10^8.
3) Removing spaces (using & instead of AND) can also give a huge result.
4) Not sure about use of & in place of AND yet (untested).
With a little jujitsu of terms, i have one search that says About 24
and the actual is 28; not bad; most of the results are useful.
But futzing them around and use of well-placed spaces, the number is
large, with multiple pages being identical to other pages (AT RANDUMB);
some of the results would clearly lie with that "short" search - but
ain't there (and vice-versa)!
I have done 500 pages in one case to squeeze out two devices not seen
anywhere else, except one with your scheme which got FOUR never seen
elsewhere.

Messy.
Reply
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules

Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On


Similar Threads
Thread Thread Starter Forum Replies Last Post
EOF Parse Text file Bam Excel Programming 2 September 24th 08 04:13 AM
Parse a txt file and save as csv? Frank Pytel Excel Programming 4 September 14th 08 09:23 PM
parse from txt file geebee Excel Programming 3 August 19th 08 07:55 PM
Parse File Location Mike Excel Worksheet Functions 5 October 3rd 07 04:05 PM
REQ: Simplest way to parse (read) HTML formatted data in via Excel VBA (or VB6) Steve[_29_] Excel Programming 3 August 25th 03 10:43 PM


All times are GMT +1. The time now is 12:53 AM.

Powered by vBulletin® Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.
Copyright ©2004-2024 ExcelBanter.
The comments are property of their posters.
 

About Us

"It's about Microsoft Excel"