I 'll eventually have the input file filled with 350 million items. Right now there is only one
$more input 3308191
The following program reads in the number from the file named 'input' and builds a url form this number. Then it builds a url from this number. I have lynx then dump the data into a file called 'out' and then just grep the entire thing for the Product Number, Product ID, SKU, UPC, and weight.
m-net% more parse.pl #!/usr/bin/perl -w
my (@shit, $read, $build, @product, @id, @sku, @upc, @weight); my $temp;
@product = grep(/Product ID/, @shit); @id = grep(/Item ID/, @shit); @sku = grep(/SKU/, @shit); @upc = grep(/UPC/, @shit); #this part doesn't grep UPC correctly. I get some extra data after UPC. @weight = grep(/Weight/, @shit);
why are you writing out the output of lynx JUST TO READ IT BACK IN AGAIN? this is the most absurd part of this program.
you have the text in $temp. you know how to use backticks but why do you do the file write and reading back in? if you assigned the backticks to an array you would get the same thing as in @shit without the wasted effort.
also calling it @shit is not a good thing.
c> @product = grep(/Product ID/, @shit); c> @id = grep(/Item ID/, @shit); c> @sku = grep(/SKU/, @shit); c> @upc = grep(/UPC/, @shit); #this part doesn't grep UPC correctly. I c> get some extra data after UPC.
that is a problem with the format of the html page. html isn't line oriented and you are grepping over lines. the proper way to deal with html is with a parser. or in special very well defined cases with regexes to actually grab what you want from the text. whole html lines are almost never what you want.
> why are you calling out to a program when perl can load web pages just > fine with LWP? did you even look for web stuff on cpan?
Would using LWP speed up the code? By the way, this code is meant to run on a server with restricted access. Ie, I can't install stuff from cpan on that server.
> why are you writing out the output of lynx JUST TO READ IT BACK IN > AGAIN? this is the most absurd part of this program.
> you have the text in $temp. you know how to use backticks but why do you > do the file write and reading back in? if you assigned the backticks to > an array you would get the same thing as in @shit without the wasted > effort.
> c> @product = grep(/Product ID/, @shit); > c> @id = grep(/Item ID/, @shit); > c> @sku = grep(/SKU/, @shit); > c> @upc = grep(/UPC/, @shit); #this part doesn't grep UPC correctly. I > c> get some extra data after UPC.
> that is a problem with the format of the html page. html isn't line > oriented and you are grepping over lines. the proper way to deal with > html is with a parser. or in special very well defined cases with > regexes to actually grab what you want from the text. whole html lines > are almost never what you want.
> > why are you calling out to a program when perl can load web pages just > > fine with LWP? did you even look for web stuff on cpan?
> Would using LWP speed up the code? By the way, this code is meant to > run on a server with restricted access. Ie, I can't install stuff from > cpan on that server.
> > why are you writing out the output of lynx JUST TO READ IT BACK IN > > AGAIN? this is the most absurd part of this program.
> > you have the text in $temp. you know how to use backticks but why do you > > do the file write and reading back in? if you assigned the backticks to > > an array you would get the same thing as in @shit without the wasted > > effort.
> > also calling it @shit is not a good thing.
> Huh? Are you saying I don't need the 'out' file?
Maybe something like this? % more parse.pl #!/usr/bin/perl -w
my (@shit, $read, $build, @product, @id, @sku, @upc, @weight); my @temp;
c> On May 15, 1:37 pm, Uri Guttman <u...@stemsystems.com> wrote: >> >>>>> "c" == chadda <cha...@lonemerchant.com> writes: >> >> i have to know if you could write this mess any slower? you are doing >> everything possible to slow you down.
c> I know I shouldn't critize free help, but you seem to have some anger c> management issues.
nope. i have bad code anger issues. i deal with this in code reviews all the time. i just don't get how people come up with wacky and slow ways to do things. i have seen worse code that read in files, parsed them, wrote them out (untouched) and read them in again.
>> c> open(IN, '<', 'input') || die "cant open: $!"; c> $read = <IN>; c> chomp($read); c> $build = "http://www.doba.com/members/catalog/".$read.".html"; c> $temp = `lynx -accept_all_cookies -dump $build`; >> >> why are you calling out to a program when perl can load web pages just >> fine with LWP? did you even look for web stuff on cpan? >> c> Would using LWP speed up the code? By the way, this code is meant to c> run on a server with restricted access. Ie, I can't install stuff from c> cpan on that server.
if you have access to load scripts you can load pure perl modules too. this is an FAQ.
c> open(OUTFILE, '>out'); c> print OUTFILE $temp; c> close OUTFILE; >> c> open(OUT, '<', 'out') || die "cant open: $!"; c> @shit = <OUT>; >> >> why are you writing out the output of lynx JUST TO READ IT BACK IN >> AGAIN? this is the most absurd part of this program. >> >> you have the text in $temp. you know how to use backticks but why do you >> do the file write and reading back in? if you assigned the backticks to >> an array you would get the same thing as in @shit without the wasted >> effort. >> >> also calling it @shit is not a good thing. >> c> Huh? Are you saying I don't need the 'out' file?
yes. why do you think you need that file? you call backticks and get the html page in $temp. why do you think you need a file to process that data? you already have it inside perl.
c> However, I don't know how to use LWP. Again, would the code run faster c> if I used LWP?
better but forking off lynx is still slow. LWP should be much faster. if you want speed (and with the data size you have, you want it), use LWP.
depending on how fast you need it (cpu usage will spike with the greps you have) you can also change all that to parse out what you want with regexes. (again, that assumes a known fixed html page layout which you seem to have).
cha...@lonemerchant.com wrote: > On May 15, 1:37 pm, Uri Guttman <u...@stemsystems.com> wrote: > chadda <cha...@lonemerchant.com> writes: > > i have to know if you could write this mess any slower? you are > > doing > > everything possible to slow you down. > I know I shouldn't critize free help, but you seem to have some anger > management issues.
He seems to constantly come across this way. I really wish he could see things from other points of view. ...
All the OP needs is LWP::Simple and HTML::TableExtract.
In fact, I wrote a whole script that took only 0.8 seconds to download and parse a single page (of course, with more id's in a file, the only real limit on the speed is the network latency and transfer speed) but I have decided not to post it as I do not know what his intentions are.
As for you, pick a posting id and stick with it.
PLONKETY PLONK!
Sinan
-- A. Sinan Unur <1...@llenroc.ude.invalid> (remove .invalid and reverse each component for email address)
> cha...@lonemerchant.com wrote: > > On May 15, 1:37 pm, Uri Guttman <u...@stemsystems.com> wrote: > > chadda <cha...@lonemerchant.com> writes: > > > i have to know if you could write this mess any slower? you are > > > doing > > > everything possible to slow you down. > > I know I shouldn't critize free help, but you seem to have some anger > > management issues.
> He seems to constantly come across this way. I really wish he could see > things from other points of view. > ...
> On May 15, 3:16 pm, "Gordon Etly" <ge...@bentsys-INVALID.com> wrote: >> cha...@lonemerchant.com wrote: >> > On May 15, 1:37 pm, Uri Guttman <u...@stemsystems.com> wrote: >> > chadda <cha...@lonemerchant.com> writes: >> > > i have to know if you could write this mess any slower? you are >> > > doing >> > > everything possible to slow you down. >> > I know I shouldn't critize free help, but you seem to have some >> > anger management issues.
TimeThis : Command Line : p list TimeThis : Start Time : Thu May 15 18:19:28 2008 TimeThis : End Time : Thu May 15 18:19:29 2008 TimeThis : Elapsed Time : 00:00:01.062
Comparing this to the overhead of an empty script:
C:\Temp> cat t.pl #!/usr/bin/perl
use strict; use warnings;
C:\Temp> timethis t
TimeThis : Command Line : t TimeThis : Start Time : Thu May 15 18:20:3